License: CC BY 4.0
arXiv:2403.18103v1 [cs.LG] 26 Mar 2024

Tutorial on Diffusion Models for Imaging and Vision

Stanley Chan111School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN 47907. Email: stanchan@purdue.edu.

March 26, 2024

Abstract. 최근 몇 년 동안 생성 도구의 놀라운 성장은 텍스트-이미지 생성 및 텍스트-비디오 생성에서 많은 흥미로운 응용 프로그램에 힘을 실어주었다. 이러한 생성 도구의 기본 원리는 이전 접근법에서 어렵다고 간주되었던 일부 단점을 극복한 특정 샘플링 메커니즘인 diffusion의 개념이다. 이 자습서의 목표는 확산 모델의 기본 개념을 논의하는 것이다. 이 튜토리얼의 대상 고객은 확산 모델에 대한 연구를 수행하거나 다른 문제를 해결하기 위해 이러한 모델을 적용하는 데 관심이 있는 학부 및 대학원생을 포함한다.

1 The Basics: Variational Auto-Encoder (VAE)

1.1 VAE Setting

오래 전, 멀리 떨어진 은하계에서 우리는 잠재 코드로부터 이미지를 생성하는 발전기를 만들고자 합니다. 가장 간단한(그리고 아마도 가장 고전적인) 접근법은 아래에 표시된 인코더-디코더 쌍을 고려하는 것이다. 이를 variational autoencoder (VAE) [1, 2, 3]라고 한다.

[Uncaptioned image]

오토인코더는 입력변수 𝐱𝐱\mathbf{x}bold_x와 잠재변수 𝐳𝐳\mathbf{z}bold_z를 갖는다. 피사체를 이해하기 위해 𝐱𝐱\mathbf{x}bold_x를 아름다운 이미지로, 𝐳𝐳\mathbf{z}bold_z를 어떤 고차원 공간에 사는 일종의 벡터로 취급한다.

Example. Getting a latent representation of an image is not an alien thing. Back in the time of JPEG compression (which is arguably a dinosaur), we use discrete cosine transform (DCT) basis 𝝋nsubscript𝝋𝑛\boldsymbol{\varphi}_{n}bold_italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT to encode the underlying image / patches of an image. The coefficient vector 𝐳=[z1,,zN]T𝐳superscriptsubscript𝑧1subscript𝑧𝑁𝑇\mathbf{z}=[z_{1},\ldots,z_{N}]^{T}bold_z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is obtained by projecting the patch 𝐱𝐱\mathbf{x}bold_x onto the space spanned by the basis: zn=𝝋n,𝐱subscript𝑧𝑛subscript𝝋𝑛𝐱z_{n}=\langle\boldsymbol{\varphi}_{n},\mathbf{x}\rangleitalic_z start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ⟨ bold_italic_φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_x ⟩. So, if you give us an image 𝐱𝐱\mathbf{x}bold_x, we will return you a coefficient vector 𝐳𝐳\mathbf{z}bold_z. From 𝐳𝐳\mathbf{z}bold_z we can do inverse transform to recover (ie decode) the image. Therefore, the coefficient vector 𝐳𝐳\mathbf{z}bold_z is the latent code. The encoder is the DCT transform, and the decoder is the inverse DCT transform.[Uncaptioned image]

"변수"라는 이름은 확률 분포를 사용하여 𝐱𝐱\mathbf{x}bold_x𝐳𝐳\mathbf{z}bold_z를 설명하는 요인에서 비롯됩니다. 𝐱𝐱\mathbf{x}bold_x𝐳𝐳\mathbf{z}bold_z로 변환하는 결정론적 절차에 의지하는 대신에, 우리는 분포 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )가 원하는 분포 p(𝐳)𝑝𝐳p(\mathbf{z})italic_p ( bold_z )에 매핑될 수 있고, p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )로 역행할 수 있도록 하는 것에 더 관심이 있다. 분포 설정 때문에 몇 가지 분포를 고려해야 합니다.

  • p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ): 𝐱𝐱\mathbf{x}bold_x의 분포. 그것은 결코 알려져 있지 않다. 그걸 알았다면 억만장자가 됐을 거야 확산 모델의 전체 은하는 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )로부터 샘플을 추출하는 방법을 찾는 것이다.

  • p(𝐳)𝑝𝐳p(\mathbf{z})italic_p ( bold_z ): The distribution of the latent variable. Because we are all lazy, let’s just make it a zero-mean unit-variance Gaussian p(𝐳)=𝒩(0,𝐈)𝑝𝐳𝒩0𝐈p(\mathbf{z})=\mathcal{N}(0,\mathbf{I})italic_p ( bold_z ) = caligraphic_N ( 0 , bold_I ).

  • p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ): The conditional distribution associated with the encoder, which tells us the likelihood of 𝐳𝐳\mathbf{z}bold_z when given 𝐱𝐱\mathbf{x}bold_x. We have no access to it. p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ) itself is not the encoder, but the encoder has to do something so that it will behave consistently with p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ).

  • p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x}|\mathbf{z})italic_p ( bold_x | bold_z ): decoder과 연관된 조건부 분포는 주어진 𝐱𝐱\mathbf{x}bold_x𝐳𝐳\mathbf{z}bold_z를 얻는 사후 확률을 알려준다. 다시 말하지만, 우리는 그것에 접근할 수 없습니다.

The four distributions above are not too mysterious. Here is a somewhat trivial but educational example that can illustrate the idea. Example. Consider a random variable 𝐗𝐗\mathbf{X}bold_X distributed according to a Gaussian mixture model with a latent variable z{1,,K}𝑧1𝐾z\in\{1,\ldots,K\}italic_z ∈ { 1 , … , italic_K } denoting the cluster identity such that pZ(k)=[Z=k]=πksubscript𝑝𝑍𝑘delimited-[]𝑍𝑘subscript𝜋𝑘p_{Z}(k)=\mathbb{P}[Z=k]=\pi_{k}italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_k ) = blackboard_P [ italic_Z = italic_k ] = italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for k=1,,K𝑘1𝐾k=1,\ldots,Kitalic_k = 1 , … , italic_K. We assume k=1Kπk=1superscriptsubscript𝑘1𝐾subscript𝜋𝑘1\sum_{k=1}^{K}\pi_{k}=1∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = 1. Then, if we are told that we need to look at the k𝑘kitalic_k-th cluster only, the conditional distribution of 𝐗𝐗\mathbf{X}bold_X given Z𝑍Zitalic_Z is p𝐗|Z(𝐱|k)=𝒩(𝐱|𝝁k,σk2𝐈).subscript𝑝conditional𝐗𝑍conditional𝐱𝑘𝒩conditional𝐱subscript𝝁𝑘subscriptsuperscript𝜎2𝑘𝐈\displaystyle p_{\mathbf{X}|Z}(\mathbf{x}|k)=\mathcal{N}(\mathbf{x}\,|\,% \boldsymbol{\mu}_{k},\sigma^{2}_{k}\mathbf{I}).italic_p start_POSTSUBSCRIPT bold_X | italic_Z end_POSTSUBSCRIPT ( bold_x | italic_k ) = caligraphic_N ( bold_x | bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_I ) . The marginal distribution of 𝐱𝐱\mathbf{x}bold_x can be found using the law of total probability, giving us p𝐗(𝐱)=k=1Kp𝐗|Z(𝐱|k)pZ(k)=k=1Kπk𝒩(𝐱|𝝁k,σk2𝐈).subscript𝑝𝐗𝐱superscriptsubscript𝑘1𝐾subscript𝑝conditional𝐗𝑍conditional𝐱𝑘subscript𝑝𝑍𝑘superscriptsubscript𝑘1𝐾subscript𝜋𝑘𝒩conditional𝐱subscript𝝁𝑘superscriptsubscript𝜎𝑘2𝐈p_{\mathbf{X}}(\mathbf{x})=\sum_{k=1}^{K}p_{\mathbf{X}|Z}(\mathbf{x}|k)p_{Z}(k% )=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(\mathbf{x}\,|\,\boldsymbol{\mu}_{k},\sigma_% {k}^{2}\mathbf{I}).italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT bold_X | italic_Z end_POSTSUBSCRIPT ( bold_x | italic_k ) italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_k ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_N ( bold_x | bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) . (1) Therefore, if we start with p𝐗(𝐱)subscript𝑝𝐗𝐱p_{\mathbf{X}}(\mathbf{x})italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_x ), the design question for the encoder to build a magical encoder such that for every sample 𝐱p𝐗(𝐱)similar-to𝐱subscript𝑝𝐗𝐱\mathbf{x}\sim p_{\mathbf{X}}(\mathbf{x})bold_x ∼ italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_x ), the latent code will be z{1,,K}𝑧1𝐾z\in\{1,\ldots,K\}italic_z ∈ { 1 , … , italic_K } with a distribution zpZ(k)similar-to𝑧subscript𝑝𝑍𝑘z\sim p_{Z}(k)italic_z ∼ italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_k ). To illustrate how the encoder and decoder work, let’s assume that the mean and variance are known and are fixed. Otherwise we will need to estimate the mean and variance through an EM algorithm. It is doable, but the tedious equations will defeat the purpose of this illustration. Encoder: How do we obtain z𝑧zitalic_z from 𝐱𝐱\mathbf{x}bold_x? This is easy because at the encoder, we know p𝐗(𝐱)subscript𝑝𝐗𝐱p_{\mathbf{X}}(\mathbf{x})italic_p start_POSTSUBSCRIPT bold_X end_POSTSUBSCRIPT ( bold_x ) and pZ(k)subscript𝑝𝑍𝑘p_{Z}(k)italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_k ). Imagine that you only have two class z{1,2}𝑧12z\in\{1,2\}italic_z ∈ { 1 , 2 }. Effectively you are just making a binary decision of where the sample 𝐱𝐱\mathbf{x}bold_x should belong to. There are many ways you can do the binary decision. If you like maximum-a-posteriori, you can check pZ|𝐗(1|𝐱)class 2class 1pZ|𝐗(2|𝐱),subscriptsuperscriptgreater-than-or-less-thanclass 1class 2subscript𝑝conditional𝑍𝐗conditional1𝐱subscript𝑝conditional𝑍𝐗conditional2𝐱p_{Z|\mathbf{X}}(1|\mathbf{x})\gtrless^{\text{class 1}}_{\text{class 2}}p_{Z|% \mathbf{X}}(2|\mathbf{x}),italic_p start_POSTSUBSCRIPT italic_Z | bold_X end_POSTSUBSCRIPT ( 1 | bold_x ) ≷ start_POSTSUPERSCRIPT class 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT class 2 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_Z | bold_X end_POSTSUBSCRIPT ( 2 | bold_x ) , and this will return you a simple decision rule. You give us 𝐱𝐱\mathbf{x}bold_x, we tell you z{1,2}𝑧12z\in\{1,2\}italic_z ∈ { 1 , 2 }. Decoder: On the decoder side, if we are given a latent code z{1,,K}𝑧1𝐾z\in\{1,\ldots,K\}italic_z ∈ { 1 , … , italic_K }, the magical decoder just needs to return us a sample 𝐱𝐱\mathbf{x}bold_x which is drawn from p𝐗|Z(𝐱|k)=𝒩(𝐱|𝝁k,σk2𝐈)subscript𝑝conditional𝐗𝑍conditional𝐱𝑘𝒩conditional𝐱subscript𝝁𝑘subscriptsuperscript𝜎2𝑘𝐈p_{\mathbf{X}|Z}(\mathbf{x}|k)=\mathcal{N}(\mathbf{x}\,|\,\boldsymbol{\mu}_{k}% ,\sigma^{2}_{k}\mathbf{I})italic_p start_POSTSUBSCRIPT bold_X | italic_Z end_POSTSUBSCRIPT ( bold_x | italic_k ) = caligraphic_N ( bold_x | bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_I ). A different z𝑧zitalic_z will give us one of the K𝐾Kitalic_K mixture components. If we have enough samples, the overall distribution will follow the Gaussian mixture.

당신 같은 똑똑한 독자들은 분명히 불평할 것이다: "당신의 예는 아주 사소한 비현실적이다." 걱정하지 마라. 이해해요 물론 삶은 알려진 평균과 알려진 분산을 가진 가우시안 혼합 모델보다 훨씬 더 어렵다. 그러나 우리가 깨달은 한가지는 우리가 마법의 인코더와 디코더를 찾으려면 두 개의 조건부 분포를 찾을 수 있는 방법이 있어야 한다는 것이다. 하지만, 그들은 둘 다 고차원적인 생물이다. 그래서 우리가 더 의미 있는 말을 하기 위해서는 더 어려운 문제에 개념을 일반화할 수 있도록 추가 구조를 부과해야 합니다.

VAE의 문헌에서 사람들은 다음 두 가지 대리 분포를 고려할 아이디어를 고안한다.

  • qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ): p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x )에 대한 프록시. 우리는 그것을 가우시안처럼 만들 것이다. 왜 가우시안이죠? 별 이유 없어 아마도 우리는 그저 평범한(일명 게으른) 인간일 것이다.

  • p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ): p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x}|\mathbf{z})italic_p ( bold_x | bold_z )에 대한 프록시. 믿거나 말거나, 우리는 가우시안으로도 만들 것입니다. 그러나 이 가우시안 의 역할은 가우시안 qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )와는 약간 다르다. estimate the mean and variance for the Gaussian qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )> 우리는 Gaussian p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z )에 대해 아무것도 추정할 필요가 없다. 대신에, 𝐳𝐳\mathbf{z}bold_z𝐱𝐱\mathbf{x}bold_x로 바꾸는 디코더 신경망이 필요할 것이다. 가우시안 p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z )는 우리가 생성한 이미지 𝐱𝐱\mathbf{x}bold_x가 얼마나 좋은지 알려주기 위해 사용될 것이다.

The relationship between the input 𝐱𝐱\mathbf{x}bold_x and the latent 𝐳𝐳\mathbf{z}bold_z, as well as the conditional distributions, are summarized in Figure 1. There are two nodes 𝐱𝐱\mathbf{x}bold_x and 𝐳𝐳\mathbf{z}bold_z. The “forward” relationship is specified by p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ) (and approximated by qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )), whereas the “reverse” relationship is specified by p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x}|\mathbf{z})italic_p ( bold_x | bold_z ) (and approximated by p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z )).

Refer to caption
Figure 1: In a variational autoencoder, the variables 𝐱𝐱\mathbf{x}bold_x and 𝐳𝐳\mathbf{z}bold_z are connected by the conditional distributions p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x}|\mathbf{z})italic_p ( bold_x | bold_z ) and p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ). To make things work, we introduce two proxy distributions p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) and qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ), respectively.
Example. It’s time to consider another trivial example. Suppose that we have a random variable 𝐱𝐱\mathbf{x}bold_x and a latent variable 𝐳𝐳\mathbf{z}bold_z such that 𝐱𝐱\displaystyle\mathbf{x}bold_x 𝒩(𝐱|μ,σ2),similar-toabsent𝒩conditional𝐱𝜇superscript𝜎2\displaystyle\sim\mathcal{N}(\mathbf{x}\,|\,\mu,\sigma^{2}),∼ caligraphic_N ( bold_x | italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , 𝐳𝐳\displaystyle\mathbf{z}bold_z 𝒩(𝐳| 0,1).similar-toabsent𝒩conditional𝐳 01\displaystyle\sim\mathcal{N}(\mathbf{z}\,|\,0,1).∼ caligraphic_N ( bold_z | 0 , 1 ) . Our goal is to construct a VAE. (What?! This problem has a trivial solution where 𝐳=(𝐱μ)/σ𝐳𝐱𝜇𝜎\mathbf{z}=(\mathbf{x}-\mu)/\sigmabold_z = ( bold_x - italic_μ ) / italic_σ and 𝐱=𝝁+σ𝐳𝐱𝝁𝜎𝐳\mathbf{x}=\boldsymbol{\mu}+\sigma\mathbf{z}bold_x = bold_italic_μ + italic_σ bold_z. You are absolutely correct. But please follow our derivation to see if the VAE framework makes sense.)[Uncaptioned image] By constructing a VAE, we mean that we want to build two mappings “encode” and “decode”. For simplicity, let’s assume that both mappings are affine transformations: 𝐳𝐳\displaystyle\mathbf{z}bold_z =encode(𝐱)=a𝐱+b,so thatϕ=[a,b],formulae-sequenceabsentencode𝐱𝑎𝐱𝑏so thatbold-italic-ϕ𝑎𝑏\displaystyle=\text{encode}(\mathbf{x})=a\mathbf{x}+b,\qquad\text{so that}% \quad\boldsymbol{\phi}=[a,b],= encode ( bold_x ) = italic_a bold_x + italic_b , so that bold_italic_ϕ = [ italic_a , italic_b ] , 𝐱𝐱\displaystyle\mathbf{x}bold_x =decode(𝐳)=c𝐳+d,so that𝜽=[c,d].formulae-sequenceabsentdecode𝐳𝑐𝐳𝑑so that𝜽𝑐𝑑\displaystyle=\text{decode}(\mathbf{z})=c\mathbf{z}+d,\qquad\text{so that}% \quad\boldsymbol{\theta}=[c,d].= decode ( bold_z ) = italic_c bold_z + italic_d , so that bold_italic_θ = [ italic_c , italic_d ] . We are too lazy to find out the joint distribution p(𝐱,𝐳)𝑝𝐱𝐳p(\mathbf{x},\mathbf{z})italic_p ( bold_x , bold_z ), nor the conditional distributions p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x}|\mathbf{z})italic_p ( bold_x | bold_z ) and p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x ). But we can construct the proxy distributions qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) and p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ). Since we have the freedom to choose what qϕsubscript𝑞bold-italic-ϕq_{\boldsymbol{\phi}}italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT and p𝜽subscript𝑝𝜽p_{\boldsymbol{\theta}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT should look like, how about we consider the following two Gaussians qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱\displaystyle q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) =𝒩(𝐳|a𝐱+b,1),absent𝒩conditional𝐳𝑎𝐱𝑏1\displaystyle=\mathcal{N}(\mathbf{z}\;|\;a\mathbf{x}+b,1),= caligraphic_N ( bold_z | italic_a bold_x + italic_b , 1 ) , p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳\displaystyle p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) =𝒩(𝐱|c𝐳+d,c).absent𝒩conditional𝐱𝑐𝐳𝑑𝑐\displaystyle=\mathcal{N}(\mathbf{x}\;|\;c\mathbf{z}+d,c).= caligraphic_N ( bold_x | italic_c bold_z + italic_d , italic_c ) . The choice of these two Gaussians is not mysterious. For qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ): if we are given 𝐱𝐱\mathbf{x}bold_x, of course we want the encoder to encode the distribution according to the structure we have chosen. Since the encoder structure is a𝐱+b𝑎𝐱𝑏a\mathbf{x}+bitalic_a bold_x + italic_b, the natural choice for qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) is to have the mean a𝐱+b𝑎𝐱𝑏a\mathbf{x}+bitalic_a bold_x + italic_b. The variance is chosen as 1 because we know that the encoded sample 𝐳𝐳\mathbf{z}bold_z should be unit-variance. Similarly, for p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ): if we are given 𝐳𝐳\mathbf{z}bold_z, the decoder must take the form of c𝐳+d𝑐𝐳𝑑c\mathbf{z}+ditalic_c bold_z + italic_d because this is how we setup the decoder. The variance is c𝑐citalic_c which is a parameter we need to figure out. We will pause for a moment before continuing this example. We want to introduce a mathematical tool.

1.2 Evidence Lower Bound

이 두 가지 프록시 분포를 사용하여 인코더와 디코더를 결정하는 목표를 달성하려면 어떻게 해야 합니까? ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ𝜽𝜽\boldsymbol{\theta}bold_italic_θ를 최적화 변수로 취급하면, 훈련 샘플을 통해 ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ𝜽𝜽\boldsymbol{\theta}bold_italic_θ를 최적화할 수 있도록 목적 함수(또는 손실 함수)가 필요하다. 이를 위해, ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ𝜽𝜽\boldsymbol{\theta}bold_italic_θ의 관점에서 손실 함수를 설정할 필요가 있다. 여기서 사용하는 손실 함수는 Evidence Lower BOund (ELBO) [1]:

한 마디로, ELBO는 lower bound for the prior distribution logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x ) because we can show that

logp(𝐱)=some magical steps𝑝𝐱some magical steps\displaystyle\log p(\mathbf{x})=\text{some magical steps}roman_log italic_p ( bold_x ) = some magical steps =𝔼qϕ(𝐳|𝐱)[logp(𝐱,𝐳)qϕ(𝐳|𝐱)]+𝔻KL(qϕ(𝐳|𝐱)p(𝐳|𝐱))\displaystyle=\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log\frac{p(% \mathbf{x},\mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\right]+% \mathbb{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\|p(\mathbf% {z}|\mathbf{x}))= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x , bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG ] + blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) ∥ italic_p ( bold_z | bold_x ) ) (3)
𝔼qϕ(𝐳|𝐱)[logp(𝐱,𝐳)qϕ(𝐳|𝐱)]absentsubscript𝔼subscript𝑞italic-ϕconditional𝐳𝐱delimited-[]𝑝𝐱𝐳subscript𝑞bold-italic-ϕconditional𝐳𝐱\displaystyle\geq\mathbb{E}_{q_{\phi}(\mathbf{z}|\mathbf{x})}\left[\log\frac{p% (\mathbf{x},\mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\right]≥ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x , bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG ]
=defELBO(𝐱),defELBO𝐱\displaystyle\overset{\text{def}}{=}\text{ELBO}(\mathbf{x}),overdef start_ARG = end_ARG ELBO ( bold_x ) ,

여기서 부등식은 KL 발산이 항상 음이 아니라는 사실에서 비롯된다. 따라서, ELBO는 logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x )에 유효한 하한이다. 우리는 결코 logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x )에 접근할 수 없기 때문에, 만약 우리가 어떻게든 ELBO에 접근할 수 있고 ELBO가 좋은 하한이라면, 우리는 효과적으로 ELBO를 최대화하여 금본위제인 logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x )를 최대화하는 목표를 달성할 수 있다. 이제, 문제는 하한선이 얼마나 좋은가이다. 식 및 그림 2에서도 알 수 있듯이, 우리의 프록시 qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )가 참 분포 p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x )와 정확히 일치할 수 있을 때 부등식은 등식이 될 것이다. 그래서, 게임의 일부는 qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )p(𝐳|𝐱)𝑝conditional𝐳𝐱p(\mathbf{z}|\mathbf{x})italic_p ( bold_z | bold_x )에 근접하도록 하는 것이다.

Refer to caption
Figure 2:Visualization of logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x ) and ELBO. 둘 사이의 간격은 KL 발산 𝔻KL(qϕ(𝐳|𝐱)p(𝐳|𝐱))\mathbb{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\|p(\mathbf% {z}|\mathbf{x}))blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) ∥ italic_p ( bold_z | bold_x ) ).
Proof of Eqn (3). The whole trick here is to use our magical proxy qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) to poke around p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) and derive the bound. logp(𝐱)𝑝𝐱\displaystyle\log p(\mathbf{x})roman_log italic_p ( bold_x ) =logp(𝐱)×qϕ(𝐳|𝐱)𝑑𝐳=1absent𝑝𝐱absent1subscript𝑞bold-italic-ϕconditional𝐳𝐱differential-d𝐳\displaystyle=\log p(\mathbf{x})\times\underset{=1}{\underbrace{\int q_{% \boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})d\mathbf{z}}}= roman_log italic_p ( bold_x ) × start_UNDERACCENT = 1 end_UNDERACCENT start_ARG under⏟ start_ARG ∫ italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) italic_d bold_z end_ARG end_ARG multiply 1 =logp(𝐱)some constant wrt 𝐳×qϕ(𝐳|𝐱)distribution in 𝐳𝑑𝐳absentsome constant wrt 𝐳𝑝𝐱distribution in 𝐳subscript𝑞bold-italic-ϕconditional𝐳𝐱differential-d𝐳\displaystyle=\int\underset{\text{some constant wrt $\mathbf{z}$}}{\underbrace% {\log p(\mathbf{x})}}\times\underset{\text{distribution in $\mathbf{z}$}}{% \underbrace{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}}d\mathbf{z}= ∫ undersome constant wrt z start_ARG under⏟ start_ARG roman_log italic_p ( bold_x ) end_ARG end_ARG × underdistribution in z start_ARG under⏟ start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG end_ARG italic_d bold_z move logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x ) into integral =𝔼qϕ(𝐳|𝐱)[logp(𝐱)],absentsubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝𝐱\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p% (\mathbf{x})],= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x ) ] , (4) where the last equality is an interesting fact that a×pZ(z)𝑑z=𝔼[a]𝑎subscript𝑝𝑍𝑧differential-d𝑧𝔼delimited-[]𝑎\int a\times p_{Z}(z)dz=\mathbb{E}[a]∫ italic_a × italic_p start_POSTSUBSCRIPT italic_Z end_POSTSUBSCRIPT ( italic_z ) italic_d italic_z = blackboard_E [ italic_a ] for any random variable Z𝑍Zitalic_Z and a scalar a𝑎aitalic_a. Of course, 𝔼[a]=a𝔼delimited-[]𝑎𝑎\mathbb{E}[a]=ablackboard_E [ italic_a ] = italic_a. See, we have already got 𝔼qϕ(𝐳|𝐱)[]subscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\cdot]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ ⋅ ]. Just a few more steps. Let’s use Bayes theorem which states that p(𝐱,𝐳)=p(𝐳|𝐱)p(𝐱)𝑝𝐱𝐳𝑝conditional𝐳𝐱𝑝𝐱p(\mathbf{x},\mathbf{z})=p(\mathbf{z}|\mathbf{x})p(\mathbf{x})italic_p ( bold_x , bold_z ) = italic_p ( bold_z | bold_x ) italic_p ( bold_x ): 𝔼qϕ(𝐳|𝐱)[logp(𝐱)]subscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝𝐱\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p(% \mathbf{x})]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x ) ] =𝔼qϕ(𝐳|𝐱)[logp(𝐱,𝐳)p(𝐳|𝐱)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝𝐱𝐳𝑝conditional𝐳𝐱\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[% \log\frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x , bold_z ) end_ARG start_ARG italic_p ( bold_z | bold_x ) end_ARG ] Bayes Theorem =𝔼qϕ(𝐳|𝐱)[logp(𝐱,𝐳)p(𝐳|𝐱)×qϕ(𝐳|𝐱)qϕ(𝐳|𝐱)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝𝐱𝐳𝑝conditional𝐳𝐱subscript𝑞bold-italic-ϕconditional𝐳𝐱subscript𝑞bold-italic-ϕconditional𝐳𝐱\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[% \log\frac{p(\mathbf{x},\mathbf{z})}{p(\mathbf{z}|\mathbf{x})}\times{\color[rgb% ]{0,0,1}\frac{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}{q_{\boldsymbol{% \phi}}(\mathbf{z}|\mathbf{x})}}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x , bold_z ) end_ARG start_ARG italic_p ( bold_z | bold_x ) end_ARG × divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG ] Multiply and divide qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) =𝔼qϕ(𝐳|𝐱)[logp(𝐱,𝐳)qϕ(𝐳|𝐱)]ELBO+𝔼qϕ(𝐳|𝐱)[logqϕ(𝐳|𝐱)p(𝐳|𝐱)]𝔻KL(qϕ(𝐳|𝐱)p(𝐳|𝐱)),\displaystyle=\underset{\text{ELBO}}{\underbrace{\mathbb{E}_{q_{\boldsymbol{% \phi}}(\mathbf{z}|\mathbf{x})}\left[\log\frac{p(\mathbf{x},\mathbf{z})}{{% \color[rgb]{0,0,1}q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}}\right]}}+% \underset{\mathbb{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})% \|p(\mathbf{z}|\mathbf{x}))}{\underbrace{\mathbb{E}_{q_{\boldsymbol{\phi}}(% \mathbf{z}|\mathbf{x})}\left[\log\frac{{\color[rgb]{0,0,1}q_{\boldsymbol{\phi}% }(\mathbf{z}|\mathbf{x})}}{p(\mathbf{z}|\mathbf{x})}\right]}},= underELBO start_ARG under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x , bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG ] end_ARG end_ARG + start_UNDERACCENT blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) ∥ italic_p ( bold_z | bold_x ) ) end_UNDERACCENT start_ARG under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG start_ARG italic_p ( bold_z | bold_x ) end_ARG ] end_ARG end_ARG , (5) where we recognize that the first term is exactly ELBO, whereas the second term is exactly the KL divergence. Comparing Eqn (5) with Eqn (3), we know that life is good.

이제 ELBO가 있습니다. 그러나 이 ELBO는 p(𝐱,𝐳)𝑝𝐱𝐳p(\mathbf{x},\mathbf{z})italic_p ( bold_x , bold_z ), 우리가 접근할 수 없는 것을 포함하기 때문에 여전히 너무 유용하지 않다. 그래서, 우리는 조금 더 많은 것들을 할 필요가 있다. ELBO를 자세히 살펴보자

ELBO(𝐱)ELBO𝐱\displaystyle\text{ELBO}(\mathbf{x})ELBO ( bold_x ) =def𝔼qϕ(𝐳|𝐱)[logp(𝐱,𝐳)qϕ(𝐳|𝐱)]defsubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝𝐱𝐳subscript𝑞bold-italic-ϕconditional𝐳𝐱\displaystyle\overset{\text{def}}{=}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{% z}|\mathbf{x})}\left[\log\frac{p(\mathbf{x},\mathbf{z})}{q_{\boldsymbol{\phi}}% (\mathbf{z}|\mathbf{x})}\right]overdef start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x , bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG ] definition
=𝔼qϕ(𝐳|𝐱)[logp(𝐱|𝐳)p(𝐳)qϕ(𝐳|𝐱)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝conditional𝐱𝐳𝑝𝐳subscript𝑞bold-italic-ϕconditional𝐳𝐱\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[% \log\frac{{\color[rgb]{0,0,1}p(\mathbf{x}|\mathbf{z})p(\mathbf{z})}}{q_{% \boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x | bold_z ) italic_p ( bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG ] p(𝐱,𝐳)=p(𝐱|𝐳)p(𝐳)𝑝𝐱𝐳𝑝conditional𝐱𝐳𝑝𝐳\displaystyle\qquad p(\mathbf{x},\mathbf{z})=p(\mathbf{x}|\mathbf{z})p(\mathbf% {z})italic_p ( bold_x , bold_z ) = italic_p ( bold_x | bold_z ) italic_p ( bold_z )
=𝔼qϕ(𝐳|𝐱)[logp(𝐱|𝐳)]+𝔼qϕ(𝐳|𝐱)[logp(𝐳)qϕ(𝐳|𝐱)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝conditional𝐱𝐳subscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]𝑝𝐳subscript𝑞bold-italic-ϕconditional𝐳𝐱\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[% \log p(\mathbf{x}|\mathbf{z})\right]+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf% {z}|\mathbf{x})}\left[\log\frac{p(\mathbf{z})}{q_{\boldsymbol{\phi}}(\mathbf{z% }|\mathbf{x})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x | bold_z ) ] + blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_z ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_ARG ] split expectation
=𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱|𝐳)]𝔻KL(qϕ(𝐳|𝐱)p(𝐳)),absentsubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]subscript𝑝𝜽conditional𝐱𝐳subscript𝔻KLconditionalsubscript𝑞bold-italic-ϕconditional𝐳𝐱𝑝𝐳\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}\left[% \log{\color[rgb]{0,0,1}p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})}\right]-% \mathbb{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})\|p(\mathbf% {z})),= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) ] - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) ∥ italic_p ( bold_z ) ) , definition of KL

여기서 우리는 접근할 수 없는 p(𝐱|𝐳)𝑝conditional𝐱𝐳p(\mathbf{x}|\mathbf{z})italic_p ( bold_x | bold_z )를 자신의 프록시 p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z )로 비밀리에 대체했다. beautiful 결과입니다. 우리는 아주 이해하기 쉬운 것을 보여줬습니다.

  • Reconstruction. 첫 번째 용어는 decoder에 관한 것이다. 우리는 잠재 𝐳𝐳\mathbf{z}bold_z를 디코더에 공급하면 디코더가 좋은 이미지 𝐱𝐱\mathbf{x}bold_x를 생성하기를 원한다(물론!!). 따라서 maximize logp𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z )를 원합니다. 우리가 이미지를 관찰할 가능성을 최대화하기 위해 모델 모수를 찾고자 하는 최대 우도와 유사하다. 여기서의 기대는 샘플 𝐳𝐳\mathbf{z}bold_z(𝐱𝐱\mathbf{x}bold_x에서 조건됨)에 대하여 취해진다. 샘플 𝐳𝐳\mathbf{z}bold_z가 디코더의 품질을 평가하는 데 사용되기 때문에 이것은 놀랄 일이 아니다. 임의의 잡음 벡터가 아니라 의미 있는 잠재 벡터이다. 그래서 𝐳𝐳\mathbf{z}bold_zqϕ(𝐳|𝐱)subscript𝑞italic-ϕconditional𝐳𝐱q_{\phi}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )로부터 샘플링할 필요가 있다.

  • Prior Matching. 두 번째 용어는 encoder에 대한 KL 발산입니다. 우리는 인코더가 𝐱𝐱\mathbf{x}bold_x를 잠재 벡터 𝐳𝐳\mathbf{z}bold_z로 바꾸어 잠재 벡터가 (게으른) 분포 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I )의 우리의 선택을 따르도록 한다. 좀 더 일반화하기 위해 p(𝐳)𝑝𝐳p(\mathbf{z})italic_p ( bold_z )를 대상 분포로 쓴다. KL은 거리이기 때문에 (두 분포가 더 유사해질 때 증가하는) 두 분포가 더 유사해질 때 증가하도록 음의 부호를 앞에 둘 필요가 있다.

  • Example. Let’s continue our trivial Gaussian example. We know from our previous derivation that qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱\displaystyle q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) =𝒩(𝐳|a𝐱+b,1),absent𝒩conditional𝐳𝑎𝐱𝑏1\displaystyle=\mathcal{N}(\mathbf{z}\;|\;a\mathbf{x}+b,1),= caligraphic_N ( bold_z | italic_a bold_x + italic_b , 1 ) , p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳\displaystyle p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) =𝒩(𝐱|c𝐳+d,c).absent𝒩conditional𝐱𝑐𝐳𝑑𝑐\displaystyle=\mathcal{N}(\mathbf{x}\;|\;c\mathbf{z}+d,c).= caligraphic_N ( bold_x | italic_c bold_z + italic_d , italic_c ) . To determine 𝜽𝜽\boldsymbol{\theta}bold_italic_θ and ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ, we need to minimize the prior matching error and maximize the reconstruction term. For the prior matching, we know that 𝔻KL(qϕ(𝐳|𝐱)p(𝐳))=𝔻KL(𝒩(𝐳|a𝐱+b,1)𝒩(𝐳| 0,1)).\displaystyle\mathbb{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x% })\|p(\mathbf{z}))=\mathbb{D}_{\text{KL}}\left(\mathcal{N}(\mathbf{z}\;|\;a% \mathbf{x}+b,1)\;\|\;\mathcal{N}(\mathbf{z}\;|\;0,1)\right).blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) ∥ italic_p ( bold_z ) ) = blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( caligraphic_N ( bold_z | italic_a bold_x + italic_b , 1 ) ∥ caligraphic_N ( bold_z | 0 , 1 ) ) . Since 𝔼[𝐱]=μ𝔼delimited-[]𝐱𝜇\mathbb{E}[\mathbf{x}]=\mublackboard_E [ bold_x ] = italic_μ and Var[𝐱]=σ2Vardelimited-[]𝐱superscript𝜎2\mathrm{Var}[\mathbf{x}]=\sigma^{2}roman_Var [ bold_x ] = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the KL-divergence is minimized when a=1σ𝑎1𝜎a=\frac{1}{\sigma}italic_a = divide start_ARG 1 end_ARG start_ARG italic_σ end_ARG and b=μσ𝑏𝜇𝜎b=-\frac{\mu}{\sigma}italic_b = - divide start_ARG italic_μ end_ARG start_ARG italic_σ end_ARG so that a𝐱+b=𝐱μσ𝑎𝐱𝑏𝐱𝜇𝜎a\mathbf{x}+b=\frac{\mathbf{x}-\mu}{\sigma}italic_a bold_x + italic_b = divide start_ARG bold_x - italic_μ end_ARG start_ARG italic_σ end_ARG. It then follows that 𝔼[a𝐱+b]=0𝔼delimited-[]𝑎𝐱𝑏0\mathbb{E}[a\mathbf{x}+b]=0blackboard_E [ italic_a bold_x + italic_b ] = 0, and Var[a𝐱+b]=1Vardelimited-[]𝑎𝐱𝑏1\mathrm{Var}[a\mathbf{x}+b]=1roman_Var [ italic_a bold_x + italic_b ] = 1. For the reconstruction term, we know that 𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱|𝐳)]=𝔼qϕ(𝐳|𝐱)[(c𝐳+dμ)22c2].subscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]subscript𝑝𝜽conditional𝐱𝐳subscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]superscript𝑐𝐳𝑑𝜇22superscript𝑐2\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_% {\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})]=\mathbb{E}_{q_{\boldsymbol{\phi}% }(\mathbf{z}|\mathbf{x})}\left[-\frac{(c\mathbf{z}+d-\mu)^{2}}{2c^{2}}\right].blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) ] = blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ - divide start_ARG ( italic_c bold_z + italic_d - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_c start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] . Since 𝔼[𝐳]=0𝔼delimited-[]𝐳0\mathbb{E}[\mathbf{z}]=0blackboard_E [ bold_z ] = 0 and Var[𝐳]=1Vardelimited-[]𝐳1\mathrm{Var}[\mathbf{z}]=1roman_Var [ bold_z ] = 1, it follows that the term is maximized when c=σ𝑐𝜎c=\sigmaitalic_c = italic_σ and d=μ𝑑𝜇d=\muitalic_d = italic_μ. To conclude, the encoder and decoder parameters are 𝐳𝐳\displaystyle\mathbf{z}bold_z =encode(𝐱)=𝐱μσ,absentencode𝐱𝐱𝜇𝜎\displaystyle=\text{encode}(\mathbf{x})=\frac{\mathbf{x}-\mu}{\sigma},= encode ( bold_x ) = divide start_ARG bold_x - italic_μ end_ARG start_ARG italic_σ end_ARG , 𝐱𝐱\displaystyle\mathbf{x}bold_x =decode(𝐳)=σ𝐳+μ,absentdecode𝐳𝜎𝐳𝜇\displaystyle=\text{decode}(\mathbf{z})=\sigma\mathbf{z}+\mu,= decode ( bold_z ) = italic_σ bold_z + italic_μ , which is fairly easy to understand.

    재구성 용어 및 사전 매칭 용어는 그림 3에 예시되어 있다. 두 경우 모두, 그리고 트레이닝 동안, 우리는 𝐳𝐳\mathbf{z}bold_z𝐱𝐱\mathbf{x}bold_x 둘 다에 액세스할 수 있다고 가정하며, 여기서 𝐳𝐳\mathbf{z}bold_zqϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )로부터 샘플링될 필요가 있다. 그런 다음 재구성을 위해 𝜽𝜽\boldsymbol{\theta}bold_italic_θ를 추정하여 p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z )를 최대화한다. 사전 매칭을 위해, 우리는 KL 발산을 최소화하기 위해 ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ를 찾는다. ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ를 업데이트하면 qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) 분포가 달라지기 때문에 최적화는 어려울 수 있다.

    Refer to caption
    그림 3:Variational autoencoder를 위한 ELBO에서 재구성 용어와 사전 매칭 용어를 해석한다.

    1.3 Training VAE

    이제 우리는 ELBO의 의미를 이해했으니, VAE를 훈련하는 방법에 대해 논의할 수 있습니다. VAE를 훈련시키기 위해서는 그라운드 트루스 쌍 (𝐱,𝐳)𝐱𝐳(\mathbf{x},\mathbf{z})( bold_x , bold_z )가 필요하다. 우리는 𝐱𝐱\mathbf{x}bold_x를 얻는 방법을 알고 있다; 그것은 단지 데이터셋의 이미지일 뿐이다. 그러나 그에 상응하여 𝐳𝐳\mathbf{z}bold_z는 어떤 것이어야 하는가?

    encoder에 대해 알아봅니다. 분포 qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )로부터 𝐳𝐳\mathbf{z}bold_z가 생성됨을 알 수 있다. 또한 qϕ(𝐳|𝐱)subscript𝑞bold-italic-ϕconditional𝐳𝐱q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x )가 가우시안임을 알 수 있다. 이 가우시안에는 평균 𝝁𝝁\boldsymbol{\mu}bold_italic_μ와 공분산 행렬 σ2𝐈superscript𝜎2𝐈\sigma^{2}\mathbf{I}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I가 있다고 가정하자(Ha! Our laziness again! 우리는 일반적인 공분산 행렬을 사용하지 않고 등분산을 가정한다).

    까다로운 부분은 입력 화상 𝐱𝐱\mathbf{x}bold_x로부터 𝝁𝝁\boldsymbol{\mu}bold_italic_μσ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT를 결정하는 방법이다. 단서가 떨어지면 걱정하지 마 다크사이드 오브 더 포스에 오신 걸 환영합니다 We construct the deep neural network(s) such to

    𝝁𝝁\displaystyle\boldsymbol{\mu}bold_italic_μ =𝝁ϕneural network(𝐱)absentneural networksubscript𝝁bold-italic-ϕ𝐱\displaystyle=\underset{\text{neural network}}{\underbrace{\boldsymbol{\mu}_{% \boldsymbol{\phi}}}}(\mathbf{x})= underneural network start_ARG under⏟ start_ARG bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x )
    σ2superscript𝜎2\displaystyle\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =σϕ2neural network(𝐱),absentneural networksuperscriptsubscript𝜎bold-italic-ϕ2𝐱\displaystyle=\underset{\text{neural network}}{\underbrace{\sigma_{\boldsymbol% {\phi}}^{2}}}(\mathbf{x}),= underneural network start_ARG under⏟ start_ARG italic_σ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ( bold_x ) ,

    따라서, 샘플 𝐳()superscript𝐳\mathbf{z}^{(\ell)}bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT(여기서 \ellroman_ℓ\ellroman_ℓ-th training sample in the training set)을 가우시안 분포로부터 샘플링할 수 있다.

    𝐳()𝒩(𝐳|𝝁ϕ(𝐱()),σϕ2(𝐱())𝐈)qϕ(𝐳|𝐱()),where 𝝁ϕ,σϕ2 are functions of 𝐱.similar-tosuperscript𝐳subscript𝑞bold-italic-ϕconditional𝐳superscript𝐱𝒩conditional𝐳subscript𝝁bold-italic-ϕsuperscript𝐱superscriptsubscript𝜎bold-italic-ϕ2superscript𝐱𝐈where 𝝁ϕ,σϕ2 are functions of 𝐱.\mathbf{z}^{(\ell)}\sim\underset{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}^{% (\ell)})}{\underbrace{\mathcal{N}(\mathbf{z}\;|\;\boldsymbol{\mu}_{\boldsymbol% {\phi}}(\mathbf{x}^{(\ell)}),\sigma_{\boldsymbol{\phi}}^{2}(\mathbf{x}^{(\ell)% })\mathbf{I})}},\qquad\text{where $\boldsymbol{\mu}_{\boldsymbol{\phi}},\sigma% _{\boldsymbol{\phi}}^{2}$ are functions of $\mathbf{x}$.}bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∼ start_UNDERACCENT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG under⏟ start_ARG caligraphic_N ( bold_z | bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) bold_I ) end_ARG end_ARG , where μϕ,σϕ2 are functions of x. (7)

    이 아이디어는 그림 4에 요약되어 있으며, 여기서 우리는 신경망을 사용하여 가우시안 파라미터를 추정하고 가우시안으로부터 샘플을 추출합니다. 𝝁ϕ(𝐱())subscript𝝁bold-italic-ϕsuperscript𝐱\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}^{(\ell)})bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT )σϕ2(𝐱())superscriptsubscript𝜎bold-italic-ϕ2superscript𝐱\sigma_{\boldsymbol{\phi}}^{2}(\mathbf{x}^{(\ell)})italic_σ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT )𝐱()superscript𝐱\mathbf{x}^{(\ell)}bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT의 함수임에 유의한다. 따라서, 상이한 𝐱()superscript𝐱\mathbf{x}^{(\ell)}bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT에 대해, 우리는 상이한 가우시안(Gaussian)을 가질 것이다.

    Refer to caption
    도 4:VAE 인코더의 구현. 우리는 신경망을 사용하여 이미지 𝐱𝐱\mathbf{x}bold_x를 취하고 가우시안 분포의 평균 𝝁ϕsubscript𝝁bold-italic-ϕ\boldsymbol{\mu}_{\boldsymbol{\phi}}bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT 및 분산 σϕ2subscriptsuperscript𝜎2bold-italic-ϕ\sigma^{2}_{\boldsymbol{\phi}}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT를 추정한다.
    Remark. For any high-dimensional Gaussian 𝐱𝒩(𝐱|𝝁,𝚺)similar-to𝐱𝒩conditional𝐱𝝁𝚺\mathbf{x}\sim\mathcal{N}(\mathbf{x}|\boldsymbol{\mu},\boldsymbol{\Sigma})bold_x ∼ caligraphic_N ( bold_x | bold_italic_μ , bold_Σ ), the sampling process can be done via the transformation of white noise 𝐱=𝝁+𝚺12𝐰,𝐱𝝁superscript𝚺12𝐰\mathbf{x}=\boldsymbol{\mu}+\boldsymbol{\Sigma}^{\frac{1}{2}}\mathbf{w},bold_x = bold_italic_μ + bold_Σ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT bold_w , (8) where 𝐰𝒩(0,𝐈)similar-to𝐰𝒩0𝐈\mathbf{w}\sim\mathcal{N}(0,\mathbf{I})bold_w ∼ caligraphic_N ( 0 , bold_I ). The half matrix 𝚺12superscript𝚺12\boldsymbol{\Sigma}^{\frac{1}{2}}bold_Σ start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT can be obtained through eigen-decomposition or Cholesky factorization. For diagonal matrices 𝚺=σ2𝐈𝚺superscript𝜎2𝐈\boldsymbol{\Sigma}=\sigma^{2}\mathbf{I}bold_Σ = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I, the above reduces to 𝐱=𝝁+σ𝐰,where𝐰𝒩(0,𝐈).formulae-sequence𝐱𝝁𝜎𝐰similar-towhere𝐰𝒩0𝐈\mathbf{x}=\boldsymbol{\mu}+\sigma\mathbf{w},\qquad\text{where}\;\mathbf{w}% \sim\mathcal{N}(0,\mathbf{I}).bold_x = bold_italic_μ + italic_σ bold_w , where bold_w ∼ caligraphic_N ( 0 , bold_I ) . (9)

    decoder에 대해 알아봅니다. 디코더는 신경망을 통해 구현된다. 표기 단순화를 위해, decode𝜽subscriptdecode𝜽\text{decode}_{\boldsymbol{\theta}}decode start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT로 정의하자. 여기서 𝜽𝜽\boldsymbol{\theta}bold_italic_θ는 네트워크 파라미터를 나타낸다. 디코더 네트워크의 작업은 잠재 변수 𝐳𝐳\mathbf{z}bold_z를 취하여 이미지 𝐱^^𝐱\widehat{\mathbf{x}}over^ start_ARG bold_x end_ARG를 생성하는 것이다:

    𝐱^=decode𝜽(𝐳).^𝐱subscriptdecode𝜽𝐳\widehat{\mathbf{x}}=\text{decode}_{\boldsymbol{\theta}}(\mathbf{z}).over^ start_ARG bold_x end_ARG = decode start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z ) . (10)

    이제, 디코딩된 이미지 𝐱^^𝐱\widehat{\mathbf{x}}over^ start_ARG bold_x end_ARG와 그라운드 트루스 이미지 𝐱𝐱\mathbf{x}bold_x 사이의 오차가 가우시안이라고 한 번 더(미친) 가정해 보자. (잠깐, 가우시안 다시?!) 우리는 그렇게 가정한다.

    (𝐱^𝐱)𝒩(0,σdec2),for some σdec2.similar-to^𝐱𝐱𝒩0superscriptsubscript𝜎dec2for some σdec2.(\widehat{\mathbf{x}}-\mathbf{x})\sim\mathcal{N}(0,\sigma_{\text{dec}}^{2}),% \qquad\text{for some $\sigma_{\text{dec}}^{2}$.}( over^ start_ARG bold_x end_ARG - bold_x ) ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , for some italic_σ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

    그러면, p𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) 분포가 다음과 같이 된다.

    logp𝜽(𝐱|𝐳)subscript𝑝𝜽conditional𝐱𝐳\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{x}|\mathbf{z})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) =log𝒩(𝐱|decode𝜽(𝐳),σdec2𝐈)absent𝒩conditional𝐱subscriptdecode𝜽𝐳superscriptsubscript𝜎dec2𝐈\displaystyle=\log\mathcal{N}(\mathbf{x}\,|\,\text{decode}_{\boldsymbol{\theta% }}(\mathbf{z}),\sigma_{\text{dec}}^{2}\mathbf{I})= roman_log caligraphic_N ( bold_x | decode start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z ) , italic_σ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I )
    =log1(2πσdec2)Dexp{𝐱decode𝜽(𝐳)22σdec2}absent1superscript2𝜋superscriptsubscript𝜎dec2𝐷superscriptnorm𝐱subscriptdecode𝜽𝐳22superscriptsubscript𝜎dec2\displaystyle=\log\frac{1}{\sqrt{(2\pi\sigma_{\text{dec}}^{2})^{D}}}\exp\left% \{-\frac{\|\mathbf{x}-\text{decode}_{\boldsymbol{\theta}}(\mathbf{z})\|^{2}}{2% \sigma_{\text{dec}}^{2}}\right\}= roman_log divide start_ARG 1 end_ARG start_ARG square-root start_ARG ( 2 italic_π italic_σ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_ARG end_ARG roman_exp { - divide start_ARG ∥ bold_x - decode start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG }
    =𝐱decode𝜽(𝐳)22σdec2log(2πσdec2)Dyou can ignore this term,absentsuperscriptnorm𝐱subscriptdecode𝜽𝐳22superscriptsubscript𝜎dec2you can ignore this termsuperscript2𝜋superscriptsubscript𝜎dec2𝐷\displaystyle=-\frac{\|\mathbf{x}-\text{decode}_{\boldsymbol{\theta}}(\mathbf{% z})\|^{2}}{2\sigma_{\text{dec}}^{2}}\;\;-\;\;\underset{\text{you can ignore % this term}}{\underbrace{\log\sqrt{(2\pi\sigma_{\text{dec}}^{2})^{D}}}},= - divide start_ARG ∥ bold_x - decode start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_z ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - underyou can ignore this term start_ARG under⏟ start_ARG roman_log square-root start_ARG ( 2 italic_π italic_σ start_POSTSUBSCRIPT dec end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT end_ARG end_ARG end_ARG , (11)

    여기서 D𝐷Ditalic_D𝐱𝐱\mathbf{x}bold_x의 차원이다. 이 방정식은 ELBO에서 우도항의 최대화는 말 그대로 디코딩된 영상과 그라운드 트루스 사이의 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 손실일 뿐이라고 말한다. 이 아이디어는 그림 5에 나와 있다.

    Refer to caption
    도 5:VAE 디코더의 구현. 뉴럴 네트워크를 이용하여 잠재 벡터 𝐳𝐳\mathbf{z}bold_z를 취하여 이미지 𝐱^^𝐱\widehat{\mathbf{x}}over^ start_ARG bold_x end_ARG를 생성한다. 우리가 가우시안 분포를 가정하면 로그 우도는 우리에게 2차 방정식을 줄 것이다.

    1.4 Loss Function

    일단 인코더와 디코더의 구조를 이해하면 손실 함수를 쉽게 이해할 수 있다. 우리는 몬테카를로 시뮬레이션을 통해 기대치를 근사화한다:

    𝔼qϕ(𝐳|𝐱)[logp𝜽(𝐱|𝐳)]1L=1Llogp𝜽(𝐱|𝐳()),𝐳()qϕ(𝐳|𝐱()),formulae-sequencesubscript𝔼subscript𝑞bold-italic-ϕconditional𝐳𝐱delimited-[]subscript𝑝𝜽conditional𝐱𝐳1𝐿superscriptsubscript1𝐿subscript𝑝𝜽conditionalsuperscript𝐱superscript𝐳similar-tosuperscript𝐳subscript𝑞bold-italic-ϕconditional𝐳superscript𝐱\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x})}[\log p_{\boldsymbol{% \theta}}(\mathbf{x}|\mathbf{z})]\approx\frac{1}{L}\sum_{\ell=1}^{L}\log p_{% \boldsymbol{\theta}}(\mathbf{x}^{\ell}|\mathbf{z}^{(\ell)}),\qquad\mathbf{z}^{% (\ell)}\sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}^{(\ell)}),blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x | bold_z ) ] ≈ divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT roman_ℓ end_POSTSUPERSCRIPT | bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) , bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∼ italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) ,

    여기서 𝐱()superscript𝐱\mathbf{x}^{(\ell)}bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT\ellroman_ℓ-th sample in the training set, and 𝐳()superscript𝐳\mathbf{z}^{(\ell)}bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT is sampled from 𝐳()qϕ(𝐳|𝐱())similar-tosuperscript𝐳subscript𝑞bold-italic-ϕconditional𝐳superscript𝐱\mathbf{z}^{(\ell)}\sim q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}^{(\ell)})bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∼ italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ). 분포 q𝜽subscript𝑞𝜽q_{\boldsymbol{\theta}}italic_q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPTqϕ(𝐳|𝐱())=𝒩(𝐳|𝝁ϕ(𝐱()),σϕ2(𝐱())𝐈)subscript𝑞bold-italic-ϕconditional𝐳superscript𝐱𝒩conditional𝐳subscript𝝁bold-italic-ϕsuperscript𝐱subscriptsuperscript𝜎2bold-italic-ϕsuperscript𝐱𝐈q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}^{(\ell)})=\mathcal{N}(\mathbf{z}|% \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}^{(\ell)}),\sigma^{2}_{% \boldsymbol{\phi}}(\mathbf{x}^{(\ell)})\mathbf{I})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) = caligraphic_N ( bold_z | bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) bold_I )이다.

    Training loss of VAE: argmaxϕ,𝜽{1L=1Llogp𝜽(𝐱()|𝐳())𝔻KL(qϕ(𝐳|𝐱())p(𝐳))},bold-italic-ϕ𝜽argmax1𝐿superscriptsubscript1𝐿subscript𝑝𝜽conditionalsuperscript𝐱superscript𝐳subscript𝔻KLconditionalsubscript𝑞bold-italic-ϕconditional𝐳superscript𝐱𝑝𝐳\mathop{\underset{\boldsymbol{\phi},\boldsymbol{\theta}}{\mbox{argmax}}}\left% \{\frac{1}{L}\sum_{\ell=1}^{L}\log p_{\boldsymbol{\theta}}(\mathbf{x}^{(\ell)}% |\mathbf{z}^{(\ell)})-\mathbb{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|% \mathbf{x}^{(\ell)})\|p(\mathbf{z}))\right\},start_BIGOP start_UNDERACCENT bold_italic_ϕ , bold_italic_θ end_UNDERACCENT start_ARG argmax end_ARG end_BIGOP { divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT | bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) - blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) ∥ italic_p ( bold_z ) ) } , (12) where {𝐱()}=1Lsuperscriptsubscriptsuperscript𝐱1𝐿\{\mathbf{x}^{(\ell)}\}_{\ell=1}^{L}{ bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT are the ground truth images in the training dataset, and 𝐳()superscript𝐳\mathbf{z}^{(\ell)}bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT is sampled from Eqn (7).

    KL 발산 항에서의 𝐳𝐳\mathbf{z}bold_z는 두 분포 사이의 KL 발산을 측정하고 있기 때문에 \ellroman_ℓ에 의존하지 않는다. 여기서의 변수 𝐳𝐳\mathbf{z}bold_z는 더미이다.

    마지막으로 우리가 명확히 해야 할 한 가지는 KL 발산이다. qϕ(𝐳|𝐱())=𝒩(𝐳|𝝁ϕ(𝐱()),σϕ2(𝐱())𝐈)subscript𝑞bold-italic-ϕconditional𝐳superscript𝐱𝒩conditional𝐳subscript𝝁bold-italic-ϕsuperscript𝐱subscriptsuperscript𝜎2bold-italic-ϕsuperscript𝐱𝐈q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}^{(\ell)})=\mathcal{N}(\mathbf{z}|% \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}^{(\ell)}),\sigma^{2}_{% \boldsymbol{\phi}}(\mathbf{x}^{(\ell)})\mathbf{I})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) = caligraphic_N ( bold_z | bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) bold_I )p(𝐳)=𝒩(0,𝐈)𝑝𝐳𝒩0𝐈p(\mathbf{z})=\mathcal{N}(0,\mathbf{I})italic_p ( bold_z ) = caligraphic_N ( 0 , bold_I )이기 때문에, 우리는 본질적으로 두 개의 가우시안 분포를 얻고 있다. 위키피디아로 가면 두 d𝑑ditalic_d-dimensional Gaussian distributions 𝒩(𝝁0,𝚺0)𝒩subscript𝝁0subscript𝚺0\mathcal{N}(\boldsymbol{\mu}_{0},\boldsymbol{\Sigma}_{0})caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )𝒩(𝝁1,𝚺1)𝒩subscript𝝁1subscript𝚺1\mathcal{N}(\boldsymbol{\mu}_{1},\boldsymbol{\Sigma}_{1})caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )에 대한 KL divergence가 있음을 알 수 있다.

    𝔻KL(𝒩(𝝁0,𝚺0),𝒩(𝝁1,𝚺1))=12(Tr(𝚺11𝚺0)d+(𝝁1𝝁0)T𝚺11(𝝁1𝝁0)+logdet𝚺1det𝚺0).subscript𝔻KL𝒩subscript𝝁0subscript𝚺0𝒩subscript𝝁1subscript𝚺112Trsuperscriptsubscript𝚺11subscript𝚺0𝑑superscriptsubscript𝝁1subscript𝝁0𝑇superscriptsubscript𝚺11subscript𝝁1subscript𝝁0detsubscript𝚺1detsubscript𝚺0\mathbb{D}_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}_{0},\boldsymbol{\Sigma}_{0% }),\mathcal{N}(\boldsymbol{\mu}_{1},\boldsymbol{\Sigma}_{1}))=\frac{1}{2}\left% (\text{Tr}(\boldsymbol{\Sigma}_{1}^{-1}\boldsymbol{\Sigma}_{0})-d+(\boldsymbol% {\mu}_{1}-\boldsymbol{\mu}_{0})^{T}\boldsymbol{\Sigma}_{1}^{-1}(\boldsymbol{% \mu}_{1}-\boldsymbol{\mu}_{0})+\log\frac{\text{det}\boldsymbol{\Sigma}_{1}}{% \text{det}\boldsymbol{\Sigma}_{0}}\right).blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( Tr ( bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_d + ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + roman_log divide start_ARG det bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG det bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) . (13)

    𝝁0=𝝁ϕ(𝐱())subscript𝝁0subscript𝝁bold-italic-ϕsuperscript𝐱\boldsymbol{\mu}_{0}=\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}^{(\ell)})bold_italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ), 𝚺0=σϕ2(𝐱())𝐈subscript𝚺0subscriptsuperscript𝜎2bold-italic-ϕsuperscript𝐱𝐈\boldsymbol{\Sigma}_{0}=\sigma^{2}_{\boldsymbol{\phi}}(\mathbf{x}^{(\ell)})% \mathbf{I}bold_Σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) bold_I, 𝝁1=0subscript𝝁10\boldsymbol{\mu}_{1}=0bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0, 𝚺1=𝐈subscript𝚺1𝐈\boldsymbol{\Sigma}_{1}=\mathbf{I}bold_Σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = bold_I를 고려하여 우리의 분포를 대입하면, 우리는 KL 발산이 분석적인 식을 가지고 있음을 보여줄 수 있다.

    𝔻KL(qϕ(𝐳|𝐱())p(𝐳))=12((σϕ2(𝐱()))d+𝝁ϕ(𝐱())T𝝁ϕ(𝐱())dlog(σϕ2(𝐱()))),subscript𝔻KLconditionalsubscript𝑞bold-italic-ϕconditional𝐳superscript𝐱𝑝𝐳12superscriptsuperscriptsubscript𝜎bold-italic-ϕ2superscript𝐱𝑑subscript𝝁bold-italic-ϕsuperscriptsuperscript𝐱𝑇subscript𝝁bold-italic-ϕsuperscript𝐱𝑑superscriptsubscript𝜎bold-italic-ϕ2superscript𝐱\mathbb{D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{z}|\mathbf{x}^{(\ell)})\;% \|\;p(\mathbf{z}))=\frac{1}{2}\left((\sigma_{\boldsymbol{\phi}}^{2}(\mathbf{x}% ^{(\ell)}))^{d}+\boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}^{(\ell)})^{T}% \boldsymbol{\mu}_{\boldsymbol{\phi}}(\mathbf{x}^{(\ell)})-d\log(\sigma_{% \boldsymbol{\phi}}^{2}(\mathbf{x}^{(\ell)}))\right),blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_z | bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) ∥ italic_p ( bold_z ) ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( ( italic_σ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT + bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) - italic_d roman_log ( italic_σ start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) ) ) , (14)

    여기서 d𝑑ditalic_d는 벡터 𝐳𝐳\mathbf{z}bold_z의 차원이다. 따라서, 전체 손실 함수 Eqn(12)은 미분 가능하다. 따라서, 우리는 기울기를 역전파함으로써 인코더와 디코더를 종단간으로 훈련시킬 수 있다.

    1.5 Inference with VAE

    For inference, we can simply throw a latent vector 𝐳𝐳\mathbf{z}bold_z (which is sampled from p(𝐳)=𝒩(0,𝐈)𝑝𝐳𝒩0𝐈p(\mathbf{z})=\mathcal{N}(0,\mathbf{I})italic_p ( bold_z ) = caligraphic_N ( 0 , bold_I )) into the decoder decode𝜽subscriptdecode𝜽\text{decode}_{\boldsymbol{\theta}}decode start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT and get an image 𝐱𝐱\mathbf{x}bold_x. That’s it; see Figure 6.

    Refer to caption
    도 6:VAE를 사용하여 이미지를 생성하는 것은 디코더를 통해 잠재 노이즈 코드 𝐳𝐳\mathbf{z}bold_z를 보내는 것만큼 간단하다.

    Congratulations! 우린 끝났어 이건 VAE에 관한 거야

    더 읽고 싶다면 Kingma와 Welling [1]의 자습서를 적극 권장합니다. 더 짧은 자습서는 [2]에서 찾을 수 있습니다. Google에서 VAE tutorial PyTorch을 입력하면 수백 개 또는 수천 개의 프로그래밍 자습서 및 비디오를 찾을 수 있습니다.

    2 Denoising Diffusion Probabilistic Model (DDPM)

    이 절에서는 Ho et al. [4]에 의한 DDPM을 논할 것이다. 온라인에서 수천 개의 자습서로 인해 혼란스럽다면 DDPM이 그렇게 복잡하지 않으니 안심하십시오. 다음 요약만 이해하면 됩니다.

    Diffusion models are incremental updates where the assembly of the whole gives us the encoder-decoder structure. The transition from one state to another is realized by a denoiser.

    왜 증가하죠? 그것은 거대한 배의 방향을 돌리는 것과 같다. 원하는 방향으로 천천히 배를 돌려야 합니다. 그렇지 않으면 통제력을 잃게 됩니다. 당신의 삶, 당신의 회사 인사, 당신의 대학 행정, 당신의 배우자, 당신의 자녀, 그리고 당신의 삶의 모든 것에 동일한 원칙이 적용된다. “한 번에 한 인치씩 구부려!” (신용: 전자 이미징 2023에서 이 발언을 한 세르히오 고마)

    확산 모델의 구조를 이하에 나타낸다. variational diffusion model [5]라고 한다. 변이 확산 모델은 상태들의 시퀀스 𝐱0,𝐱1,,𝐱Tsubscript𝐱0subscript𝐱1subscript𝐱𝑇\mathbf{x}_{0},\mathbf{x}_{1},\ldots,\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT를 갖는다:

    • 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT: 원본 이미지이며, 이는 VAE에서 𝐱𝐱\mathbf{x}bold_x와 동일하다.

    • 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT: 잠재변수이며, 이는 VAE에서 𝐳𝐳\mathbf{z}bold_z와 동일하다. 우리는 모두 게으르기 때문에 𝐱T𝒩(0,𝐈)similar-tosubscript𝐱𝑇𝒩0𝐈\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I )를 원한다.

    • 𝐱1,,𝐱T1subscript𝐱1subscript𝐱𝑇1\mathbf{x}_{1},\ldots,\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT: 이들은 중간 상태이다. 잠재 변수이기도 하지만 흰색 가우시안도 아닙니다.

    변분 확산 모형의 구조는 7와 같다. 순방향 및 역방향 경로는 단일 단계 가변 오토인코더의 경로와 유사하다. 차이점은 인코더와 디코더가 동일한 입력-출력 차원을 갖는다는 것이다. 모든 순방향 빌딩 블록들의 조립은 인코더를 줄 것이고, 모든 역방향 빌딩 블록들의 조립은 디코더를 줄 것이다.

    Refer to caption
    Figure 7:Variational diffusion model. 이 모델에서, 입력 영상은 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT이고, 백색 노이즈는 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT이다. 중간 변수(또는 상태) 𝐱1,,𝐱T1subscript𝐱1subscript𝐱𝑇1\mathbf{x}_{1},\ldots,\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT는 잠재 변수이다. 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT에서 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT로의 전이는 VAE에서의 순방향 단계(인코더)와 유사한 반면, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT에서 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT로의 전이는 VAE에서의 역방향 단계(디코더)와 유사하다. 그러나, 여기서 인코더/디코더의 입력 차원과 출력 차원은 동일하다는 점에 유의한다.

    2.1 Building Blocks

    Transition Block The t𝑡titalic_t-th transition block consists three states 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT. 8에 예시된 바와 같이 상태 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT에 도달하기 위한 두 가지 가능한 경로가 있다.

    • 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT에서 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT로 넘어가는 순방향 천이. 연관된 전이 분포는 p(𝐱t|𝐱t1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )이다. 쉽게 말해, 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT라고 말하면, p(𝐱t|𝐱t1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )에 따라 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT라고 말할 수 있다. 그러나, VAE와 마찬가지로, 전이 분포 p(𝐱t|𝐱t1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )는 결코 액세스할 수 없다. 하지만 이건 괜찮아 우리 같은 게으른 사람들은 가우시안 qϕ(𝐱t|𝐱t1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )로 근사할 것이다. 우리는 나중에 qϕsubscript𝑞bold-italic-ϕq_{\boldsymbol{\phi}}italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT의 정확한 형태에 대해 논의할 것이지만, 그것은 단지 일부 가우시안일 뿐이다.

    • 역전이는 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT에서 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT로 진행된다. 다시 말하지만, 우리는 p(𝐱t+1|𝐱t)𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡p(\mathbf{x}_{t+1}|\mathbf{x}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )를 결코 알지 못하지만, 괜찮습니다. 우리는 참 분포를 근사화하기 위해 다른 가우시안 p𝜽(𝐱t+1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t+1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )를 사용할 뿐, 그 평균은 신경망에 의해 추정될 필요가 있다.

    Refer to caption
    그림 8:변이 확산 모델의 전이 블록은 세 개의 노드로 구성된다. 전이 분포 p(𝐱t|𝐱t+1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t+1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )p(𝐱t|𝐱t1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )는 접근할 수 없지만 Gaussians로 근사화할 수 있다.

    Initial Block Variational diffusion 모델의 초기 블록은 상태 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT에 초점을 맞춘다. 우리가 연구하는 모든 문제는 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT에서 시작되기 때문에 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT에서 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT로 역전이만 있을 뿐 𝐱1subscript𝐱1\mathbf{x}_{-1}bold_x start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT에서 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT로 가는 문제는 없다. 따라서 p(𝐱0|𝐱1)𝑝conditionalsubscript𝐱0subscript𝐱1p(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )에 대해서만 걱정하면 된다. 그러나 p(𝐱0|𝐱1)𝑝conditionalsubscript𝐱0subscript𝐱1p(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )는 결코 접근할 수 없기 때문에, 우리는 평균을 신경망을 통해 계산하는 가우시안 p𝜽(𝐱0|𝐱1)subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )로 근사한다. 9를 예시로 참조한다.

    Refer to caption
    도 9:variational diffusion 모델의 초기 블록은 노드 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT에 초점을 맞춘다. 시간 t=0𝑡0t=0italic_t = 0 이전에는 상태가 없기 때문에, 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT에서 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT로 역전이할 뿐이다.

    Final Block. 최종 블록은 상태 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT에 초점을 맞춘다. 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT는 백색 가우시안 잡음 벡터인 우리의 최종 잠재 변수라는 것을 기억하라. 최종 블록이기 때문에, 𝐱T1subscript𝐱𝑇1\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT에서 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT로 순방향 천이만 있을 뿐, 𝐱T+1subscript𝐱𝑇1\mathbf{x}_{T+1}bold_x start_POSTSUBSCRIPT italic_T + 1 end_POSTSUBSCRIPT에서 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT와 같은 것은 없다. 순방향 전이는 가우시안인 qϕ(𝐱T|𝐱T1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT )로 근사화된다. 10를 예시로 참조한다.

    Refer to caption
    도 10:variational diffusion 모델의 최종 블록은 노드 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT에 초점을 맞춘다. 시간 t=T𝑡𝑇t=Titalic_t = italic_T 이후에는 상태가 없으므로, 𝐱T1subscript𝐱𝑇1\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT에서 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT로 순방향 천이할 뿐이다.

    Understanding the Transition Distribution. 우리가 더 진행하기 전에, 우리는 전이 분포 qϕ(𝐱t|𝐱t1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )에 대해 이야기하기 위해 약간 우회할 필요가 있다. 우리는 그것이 가우시안이라는 것을 안다. 그러나 우리는 여전히 그것의 공식적인 정의와 이 정의의 기원을 알아야 합니다.

    Transition Distribution qϕ(𝐱t|𝐱t1)subscript𝑞italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ). In a denoising diffusion probabilistic model, the transition distribution qϕ(𝐱t|𝐱t1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) is defined as qϕ(𝐱t|𝐱t1)=def𝒩(𝐱t|αt𝐱t1,(1αt)𝐈).subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1def𝒩conditionalsubscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡𝐈q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})\overset{\text{def}}{=}% \mathcal{N}(\mathbf{x}_{t}\,|\,\sqrt{\alpha_{t}}\mathbf{x}_{t-1},(1-\alpha_{t}% )\mathbf{I}).italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) overdef start_ARG = end_ARG caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) . (15)

    즉, 평균은 αt𝐱t1subscript𝛼𝑡subscript𝐱𝑡1\sqrt{\alpha_{t}}\mathbf{x}_{t-1}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT이고, 분산은 1αt1subscript𝛼𝑡1-\alpha_{t}1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT이다. 스케일링 팩터 αtsubscript𝛼𝑡\sqrt{\alpha_{t}}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG의 선택은 분산 크기가 많은 반복 후에 폭발하고 사라지지 않도록 보존되도록 하는 것이다.

    Example. Let’s consider a Gaussian mixture model 𝐱0p0(𝐱)=π1𝒩(𝐱|μ1,σ12)+π2𝒩(𝐱|μ2,σ22).similar-tosubscript𝐱0subscript𝑝0𝐱subscript𝜋1𝒩conditional𝐱subscript𝜇1superscriptsubscript𝜎12subscript𝜋2𝒩conditional𝐱subscript𝜇2superscriptsubscript𝜎22\mathbf{x}_{0}\sim p_{0}(\mathbf{x})=\pi_{1}\mathcal{N}(\mathbf{x}|\mu_{1},% \sigma_{1}^{2})+\pi_{2}\mathcal{N}(\mathbf{x}|\mu_{2},\sigma_{2}^{2}).bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_N ( bold_x | italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_N ( bold_x | italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . Given the transition probability, we know that 𝐱t=αt𝐱t1+(1αt)ϵ,whereϵ𝒩(0,𝐈).formulae-sequencesubscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡bold-italic-ϵsimilar-towherebold-italic-ϵ𝒩0𝐈\mathbf{x}_{t}=\sqrt{\alpha_{t}}\mathbf{x}_{t-1}+\sqrt{(1-\alpha_{t})}% \boldsymbol{\epsilon},\qquad\text{where}\;\;\boldsymbol{\epsilon}\sim\mathcal{% N}(0,\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_italic_ϵ , where bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) . For a mixture model, it is not difficult to show that the probability distribution of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be calculated recursively via the algorithm for t=1,2,,T𝑡12𝑇t=1,2,\ldots,Titalic_t = 1 , 2 , … , italic_T: pt(𝐱)=subscript𝑝𝑡𝐱absent\displaystyle p_{t}(\mathbf{x})=italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) = π1𝒩(𝐱|αtμ1,t1,αtσ1,t12+(1αt))subscript𝜋1𝒩conditional𝐱subscript𝛼𝑡subscript𝜇1𝑡1subscript𝛼𝑡superscriptsubscript𝜎1𝑡121subscript𝛼𝑡\displaystyle\pi_{1}\mathcal{N}(\mathbf{x}|\sqrt{\alpha_{t}}\mu_{1,t-1},\alpha% _{t}\sigma_{1,t-1}^{2}+(1-\alpha_{t}))italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_N ( bold_x | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_μ start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) +\displaystyle++ π2𝒩(𝐱|αtμ2,t1,αtσ2,t12+(1αt)),subscript𝜋2𝒩conditional𝐱subscript𝛼𝑡subscript𝜇2𝑡1subscript𝛼𝑡superscriptsubscript𝜎2𝑡121subscript𝛼𝑡\displaystyle\pi_{2}\mathcal{N}(\mathbf{x}|\sqrt{\alpha_{t}}\mu_{2,t-1},\alpha% _{t}\sigma_{2,t-1}^{2}+(1-\alpha_{t})),italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_N ( bold_x | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_μ start_POSTSUBSCRIPT 2 , italic_t - 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT 2 , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (16) where μ1,t1subscript𝜇1𝑡1\mu_{1,t-1}italic_μ start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT is the mean at t1𝑡1t-1italic_t - 1, with μ1,0=μ1subscript𝜇10subscript𝜇1\mu_{1,0}=\mu_{1}italic_μ start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT being the initial mean. Similarly, σ1,t12superscriptsubscript𝜎1𝑡12\sigma_{1,t-1}^{2}italic_σ start_POSTSUBSCRIPT 1 , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the variance at t1𝑡1t-1italic_t - 1, with σ1,02=σ12superscriptsubscript𝜎102superscriptsubscript𝜎12\sigma_{1,0}^{2}=\sigma_{1}^{2}italic_σ start_POSTSUBSCRIPT 1 , 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT being the initial variance. In the figure below, we show the example where π1=0.3subscript𝜋10.3\pi_{1}=0.3italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.3, π2=0.7subscript𝜋20.7\pi_{2}=0.7italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.7, μ1=2subscript𝜇12\mu_{1}=-2italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - 2, μ2=2subscript𝜇22\mu_{2}=2italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 2, σ1=0.2subscript𝜎10.2\sigma_{1}=0.2italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.2, and σ2=1subscript𝜎21\sigma_{2}=1italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1. The rate is defined as αt=0.97subscript𝛼𝑡0.97\alpha_{t}=0.97italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.97 for all t𝑡titalic_t. We plot the probability distribution function for different t𝑡titalic_t. [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
    Remark. For those who would like to understand how we derive the probability density of a mixture model in Eqn (16), we can show a simple derivation. Consider a mixture model p(𝐱)=k=1Kπk𝒩(𝐱|μk,σk2𝐈)p(𝐱|k).𝑝𝐱superscriptsubscript𝑘1𝐾subscript𝜋𝑘𝑝conditional𝐱𝑘𝒩conditional𝐱subscript𝜇𝑘superscriptsubscript𝜎𝑘2𝐈\displaystyle p(\mathbf{x})=\sum_{k=1}^{K}\pi_{k}\underset{p(\mathbf{x}|k)}{% \underbrace{\mathcal{N}(\mathbf{x}|\mu_{k},\sigma_{k}^{2}\mathbf{I})}}.italic_p ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_UNDERACCENT italic_p ( bold_x | italic_k ) end_UNDERACCENT start_ARG under⏟ start_ARG caligraphic_N ( bold_x | italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) end_ARG end_ARG . If we consider a new variable 𝐲=α𝐱+1αϵ𝐲𝛼𝐱1𝛼bold-italic-ϵ\mathbf{y}=\sqrt{\alpha}\mathbf{x}+\sqrt{1-\alpha}\boldsymbol{\epsilon}bold_y = square-root start_ARG italic_α end_ARG bold_x + square-root start_ARG 1 - italic_α end_ARG bold_italic_ϵ where ϵ𝒩(0,𝐈)similar-tobold-italic-ϵ𝒩0𝐈\boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I})bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ), then the distribution of 𝐲𝐲\mathbf{y}bold_y can be derived by using the law of total probability: p(𝐲)𝑝𝐲\displaystyle p(\mathbf{y})italic_p ( bold_y ) =k=1Kp(𝐲|k)p(k)=k=1Kπkp(𝐲|k).absentsuperscriptsubscript𝑘1𝐾𝑝conditional𝐲𝑘𝑝𝑘superscriptsubscript𝑘1𝐾subscript𝜋𝑘𝑝conditional𝐲𝑘\displaystyle=\sum_{k=1}^{K}p(\mathbf{y}|k)p(k)=\sum_{k=1}^{K}\pi_{k}p(\mathbf% {y}|k).= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_p ( bold_y | italic_k ) italic_p ( italic_k ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_p ( bold_y | italic_k ) . Since 𝐲|kconditional𝐲𝑘\mathbf{y}|kbold_y | italic_k is a linear combination of a Gaussian random variable 𝐱𝐱\mathbf{x}bold_x and another Gaussian random variable ϵbold-italic-ϵ\boldsymbol{\epsilon}bold_italic_ϵ, the sum 𝐲𝐲\mathbf{y}bold_y will remain as a Gaussian. The mean is 𝔼[𝐲|k]𝔼delimited-[]conditional𝐲𝑘\displaystyle\mathbb{E}[\mathbf{y}|k]blackboard_E [ bold_y | italic_k ] =α𝔼[𝐱|k]+1α𝔼[ϵ]=αμkabsent𝛼𝔼delimited-[]conditional𝐱𝑘1𝛼𝔼delimited-[]bold-italic-ϵ𝛼subscript𝜇𝑘\displaystyle=\sqrt{\alpha}\mathbb{E}[\mathbf{x}|k]+\sqrt{1-\alpha}\mathbb{E}[% \boldsymbol{\epsilon}]=\sqrt{\alpha}\mu_{k}= square-root start_ARG italic_α end_ARG blackboard_E [ bold_x | italic_k ] + square-root start_ARG 1 - italic_α end_ARG blackboard_E [ bold_italic_ϵ ] = square-root start_ARG italic_α end_ARG italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Var[𝐲|k]Vardelimited-[]conditional𝐲𝑘\displaystyle\mathrm{Var}[\mathbf{y}|k]roman_Var [ bold_y | italic_k ] =αVar[𝐱|k]+(1α)Var[ϵ]=ασk2+(1α).absent𝛼Vardelimited-[]conditional𝐱𝑘1𝛼Vardelimited-[]bold-italic-ϵ𝛼superscriptsubscript𝜎𝑘21𝛼\displaystyle=\alpha\mathrm{Var}[\mathbf{x}|k]+(1-\alpha)\mathrm{Var}[% \boldsymbol{\epsilon}]=\alpha\sigma_{k}^{2}+(1-\alpha).= italic_α roman_Var [ bold_x | italic_k ] + ( 1 - italic_α ) roman_Var [ bold_italic_ϵ ] = italic_α italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_α ) . So, p(𝐲|k)=𝒩(𝐲|αμk,ασk2+(1α))𝑝conditional𝐲𝑘𝒩conditional𝐲𝛼subscript𝜇𝑘𝛼superscriptsubscript𝜎𝑘21𝛼p(\mathbf{y}|k)=\mathcal{N}(\mathbf{y}|\sqrt{\alpha}\mu_{k},\alpha\sigma_{k}^{% 2}+(1-\alpha))italic_p ( bold_y | italic_k ) = caligraphic_N ( bold_y | square-root start_ARG italic_α end_ARG italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_α italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( 1 - italic_α ) ). This completes the derivation.

    2.2 The magical scalars αtsubscript𝛼𝑡\sqrt{\alpha_{t}}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and 1αt1subscript𝛼𝑡1-\alpha_{t}1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

    위 천이 확률에 대해 지니(디노이징 확산의 저자)가 마법의 스칼라 αtsubscript𝛼𝑡\sqrt{\alpha}_{t}square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT(1αt)1subscript𝛼𝑡(1-\alpha_{t})( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )를 어떻게 떠올리는지 궁금할 수 있다. 이를 변형하기 위해, 관련 없는 두 개의 스칼라 a𝑎a\in\mathbb{R}italic_a ∈ blackboard_Rb𝑏b\in\mathbb{R}italic_b ∈ blackboard_R로 시작하며, 우리는 전이 분포를 다음과 같이 정의한다.

    qϕ(𝐱t|𝐱t1)=𝒩(𝐱t|a𝐱t1,b2𝐈).subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩conditionalsubscript𝐱𝑡𝑎subscript𝐱𝑡1superscript𝑏2𝐈q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_% {t}\,|\,a\mathbf{x}_{t-1},b^{2}\mathbf{I}).italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_a bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) . (17)

    Here is the rule of thumb: Why αtsubscript𝛼𝑡\sqrt{\alpha_{t}}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG and 1αt1subscript𝛼𝑡1-\alpha_{t}1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT? We want to choose a𝑎aitalic_a and b𝑏bitalic_b such that the distribution of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will become 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ) when t𝑡titalic_t is large enough. It turns out that the answer is a=α𝑎𝛼a=\sqrt{\alpha}italic_a = square-root start_ARG italic_α end_ARG and b=1α𝑏1𝛼b=\sqrt{1-\alpha}italic_b = square-root start_ARG 1 - italic_α end_ARG. Proof. We want to show that a=α𝑎𝛼a=\sqrt{\alpha}italic_a = square-root start_ARG italic_α end_ARG and b=1α𝑏1𝛼b=\sqrt{1-\alpha}italic_b = square-root start_ARG 1 - italic_α end_ARG. For the distribution shown in Eqn (17), the equivalent sampling step is: 𝐱t=a𝐱t1+bϵt1,whereϵt1𝒩(0,𝐈).formulae-sequencesubscript𝐱𝑡𝑎subscript𝐱𝑡1𝑏subscriptbold-italic-ϵ𝑡1wheresimilar-tosubscriptbold-italic-ϵ𝑡1𝒩0𝐈\mathbf{x}_{t}=a\mathbf{x}_{t-1}+b\boldsymbol{\epsilon}_{t-1},\qquad\text{% where}\qquad\boldsymbol{\epsilon}_{t-1}\sim\mathcal{N}(0,\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_b bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , where bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) . (18) Think about this: if there is a random variable X𝒩(μ,σ2)similar-to𝑋𝒩𝜇superscript𝜎2X\sim\mathcal{N}(\mu,\sigma^{2})italic_X ∼ caligraphic_N ( italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), drawing X𝑋Xitalic_X from this Gaussian can be equivalently achieved by defining X=μ+ση𝑋𝜇𝜎𝜂X=\mu+\sigma\etaitalic_X = italic_μ + italic_σ italic_η where η𝒩(0,1)similar-to𝜂𝒩01\eta\sim\mathcal{N}(0,1)italic_η ∼ caligraphic_N ( 0 , 1 ). We can carry on the recursion to show that 𝐱tsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =a𝐱t1+bϵt1absent𝑎subscript𝐱𝑡1𝑏subscriptbold-italic-ϵ𝑡1\displaystyle=a\mathbf{x}_{t-1}+b\boldsymbol{\epsilon}_{t-1}= italic_a bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_b bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =a(a𝐱t2+bϵt2)+bϵt1absent𝑎𝑎subscript𝐱𝑡2𝑏subscriptbold-italic-ϵ𝑡2𝑏subscriptbold-italic-ϵ𝑡1\displaystyle=a(a\mathbf{x}_{t-2}+b\boldsymbol{\epsilon}_{t-2})+b\boldsymbol{% \epsilon}_{t-1}= italic_a ( italic_a bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + italic_b bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) + italic_b bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT (substitute 𝐱t1=a𝐱t2+bϵt2)substitute subscript𝐱𝑡1𝑎subscript𝐱𝑡2𝑏subscriptbold-italic-ϵ𝑡2\displaystyle\qquad(\text{substitute }\mathbf{x}_{t-1}=a\mathbf{x}_{t-2}+b% \boldsymbol{\epsilon}_{t-2})( substitute bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = italic_a bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + italic_b bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) =a2𝐱t2+abϵt2+bϵt1absentsuperscript𝑎2subscript𝐱𝑡2𝑎𝑏subscriptbold-italic-ϵ𝑡2𝑏subscriptbold-italic-ϵ𝑡1\displaystyle=a^{2}\mathbf{x}_{t-2}+ab\boldsymbol{\epsilon}_{t-2}+b\boldsymbol% {\epsilon}_{t-1}= italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + italic_a italic_b bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + italic_b bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT (regroup terms )regroup terms \displaystyle\qquad(\text{regroup terms })( regroup terms ) =absent\displaystyle=\vdots= ⋮ =at𝐱0+b[ϵt1+aϵt2+a2ϵt3++at1ϵ0]=def𝐰t.absentsuperscript𝑎𝑡subscript𝐱0𝑏defsubscript𝐰𝑡delimited-[]subscriptbold-italic-ϵ𝑡1𝑎subscriptbold-italic-ϵ𝑡2superscript𝑎2subscriptbold-italic-ϵ𝑡3superscript𝑎𝑡1subscriptbold-italic-ϵ0\displaystyle=a^{t}\mathbf{x}_{0}+b\underset{\overset{\text{def}}{=}\mathbf{w}% _{t}}{\underbrace{\left[\boldsymbol{\epsilon}_{t-1}+a\boldsymbol{\epsilon}_{t-% 2}+a^{2}\boldsymbol{\epsilon}_{t-3}+\ldots+a^{t-1}\boldsymbol{\epsilon}_{0}% \right]}}.= italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b start_UNDERACCENT overdef start_ARG = end_ARG bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG [ bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_a bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 3 end_POSTSUBSCRIPT + … + italic_a start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_ARG end_ARG . (19) The finite sum above is a sum of independent Gaussian random variables. The mean vector 𝔼[𝐰t]𝔼delimited-[]subscript𝐰𝑡\mathbb{E}[\mathbf{w}_{t}]blackboard_E [ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] remains zero because everyone has a zero mean. The covariance matrix (for a zero-mean vector) is Cov[𝐰t]=def𝔼[𝐰t𝐰tT]Covdelimited-[]subscript𝐰𝑡def𝔼delimited-[]subscript𝐰𝑡superscriptsubscript𝐰𝑡𝑇\displaystyle\text{Cov}[\mathbf{w}_{t}]\overset{\text{def}}{=}\mathbb{E}[% \mathbf{w}_{t}\mathbf{w}_{t}^{T}]Cov [ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] overdef start_ARG = end_ARG blackboard_E [ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] =b2(Cov(ϵt1)+a2Cov(ϵt2)++(at1)2Cov(ϵ0))absentsuperscript𝑏2Covsubscriptbold-italic-ϵ𝑡1superscript𝑎2Covsubscriptbold-italic-ϵ𝑡2superscriptsuperscript𝑎𝑡12Covsubscriptbold-italic-ϵ0\displaystyle=b^{2}(\text{Cov}(\boldsymbol{\epsilon}_{t-1})+a^{2}\text{Cov}(% \boldsymbol{\epsilon}_{t-2})+\ldots+(a^{t-1})^{2}\text{Cov}(\boldsymbol{% \epsilon}_{0}))= italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( Cov ( bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Cov ( bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) + … + ( italic_a start_POSTSUPERSCRIPT italic_t - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Cov ( bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) =b2(1+a2+a4++a2(t1))𝐈absentsuperscript𝑏21superscript𝑎2superscript𝑎4superscript𝑎2𝑡1𝐈\displaystyle=b^{2}(1+a^{2}+a^{4}+\ldots+a^{2(t-1)})\mathbf{I}= italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_a start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + … + italic_a start_POSTSUPERSCRIPT 2 ( italic_t - 1 ) end_POSTSUPERSCRIPT ) bold_I =b21a2t11a2𝐈.absentsuperscript𝑏21superscript𝑎2𝑡11superscript𝑎2𝐈\displaystyle=b^{2}\cdot\frac{1-a^{2t-1}}{1-a^{2}}\mathbf{I}.= italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG 1 - italic_a start_POSTSUPERSCRIPT 2 italic_t - 1 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_I . As t𝑡t\rightarrow\inftyitalic_t → ∞, at0superscript𝑎𝑡0a^{t}\rightarrow 0italic_a start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT → 0 for any 0<a<10𝑎10<a<10 < italic_a < 1. Therefore, at the limit when t=𝑡t=\inftyitalic_t = ∞, limtCov[𝐰t]=b21a2𝐈.subscript𝑡Covdelimited-[]subscript𝐰𝑡superscript𝑏21superscript𝑎2𝐈\displaystyle\lim_{t\rightarrow\infty}\text{Cov}[\mathbf{w}_{t}]=\frac{b^{2}}{% 1-a^{2}}\mathbf{I}.roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT Cov [ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = divide start_ARG italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_I . So, if we want limtCov[𝐰t]=𝐈subscript𝑡Covdelimited-[]subscript𝐰𝑡𝐈\lim_{t\rightarrow\infty}\text{Cov}[\mathbf{w}_{t}]=\mathbf{I}roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT Cov [ bold_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = bold_I (so that the distribution of 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT will approach 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ), then b=1a2𝑏1superscript𝑎2b=\sqrt{1-a^{2}}italic_b = square-root start_ARG 1 - italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Now, if we let a=α𝑎𝛼a=\sqrt{\alpha}italic_a = square-root start_ARG italic_α end_ARG, then b=1α𝑏1𝛼b=\sqrt{1-\alpha}italic_b = square-root start_ARG 1 - italic_α end_ARG. This will give us 𝐱t=α𝐱t1+1αϵt1.subscript𝐱𝑡𝛼subscript𝐱𝑡11𝛼subscriptbold-italic-ϵ𝑡1\mathbf{x}_{t}=\sqrt{\alpha}\mathbf{x}_{t-1}+\sqrt{1-\alpha}\boldsymbol{% \epsilon}_{t-1}.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT . (20) Or equivalently, qϕ(𝐱|𝐱t1)=𝒩(𝐱t|α𝐱t1,(1α)𝐈)subscript𝑞bold-italic-ϕconditional𝐱subscript𝐱𝑡1𝒩conditionalsubscript𝐱𝑡𝛼subscript𝐱𝑡11𝛼𝐈q_{\boldsymbol{\phi}}(\mathbf{x}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t}% \,|\,\sqrt{\alpha}\mathbf{x}_{t-1},(1-\alpha)\mathbf{I})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG italic_α end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α ) bold_I ). You can replace α𝛼\alphaitalic_α by αtsubscript𝛼𝑡\alpha_{t}italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, if you prefer a scheduler.

    2.3 Distribution qϕ(𝐱t|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

    마법 스칼라에 대한 이해로 qϕ(𝐱t|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) 분포에 대해 이야기할 수 있다. 즉, 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT가 주어진다면 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT가 어떻게 분배될 것인지를 알고자 한다.

    Conditional distribution qϕ(𝐱t|𝐱0)subscript𝑞italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). The conditional distribution qϕ(𝐱t|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is given by qϕ(𝐱t|𝐱0)=𝒩(𝐱t|α¯t𝐱0,(1α¯t)𝐈),subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0𝒩conditionalsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t% }\,|\,\sqrt{\overline{\alpha}}_{t}\mathbf{x}_{0},\;\;(1-\overline{\alpha}_{t})% \mathbf{I}),italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) , (21) where α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.
    Proof. To see why this is the case, we can re-do the recursion but this time we use αt𝐱t1subscript𝛼𝑡subscript𝐱𝑡1\sqrt{\alpha_{t}}\mathbf{x}_{t-1}square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and (1αt)𝐈1subscript𝛼𝑡𝐈(1-\alpha_{t})\mathbf{I}( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I as the mean and covariance, respectively. This will give us 𝐱tsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =αt𝐱t1+1αtϵt1absentsubscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡subscriptbold-italic-ϵ𝑡1\displaystyle=\sqrt{\alpha_{t}}\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\boldsymbol% {\epsilon}_{t-1}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =αt(αt1𝐱t2+1αt1ϵt2)+1αtϵt1absentsubscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡21subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡21subscript𝛼𝑡subscriptbold-italic-ϵ𝑡1\displaystyle=\sqrt{\alpha_{t}}(\sqrt{\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-% \alpha_{t-1}}\boldsymbol{\epsilon}_{t-2})+\sqrt{1-\alpha_{t}}\boldsymbol{% \epsilon}_{t-1}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT ) + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =αtαt1𝐱t2+αt1αt1ϵt2+1αtϵt1𝐰1.absentsubscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡2subscript𝐰1subscript𝛼𝑡1subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡21subscript𝛼𝑡subscriptbold-italic-ϵ𝑡1\displaystyle=\sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2}+\underset{\mathbf{% w}_{1}}{\underbrace{\sqrt{\alpha_{t}}\sqrt{1-\alpha_{t-1}}\boldsymbol{\epsilon% }_{t-2}+\sqrt{1-\alpha_{t}}\boldsymbol{\epsilon}_{t-1}}}.= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + start_UNDERACCENT bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG . (22) Therefore, we have a sum of two Gaussians. But since sum of two Gaussians remains a Gaussian, we can just calculate its new covariance (because the mean remains zero). The new covariance is 𝔼[𝐰1𝐰1T]𝔼delimited-[]subscript𝐰1superscriptsubscript𝐰1𝑇\displaystyle\mathbb{E}[\mathbf{w}_{1}\mathbf{w}_{1}^{T}]blackboard_E [ bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] =[(αt1αt1)2+(1αt)2]𝐈absentdelimited-[]superscriptsubscript𝛼𝑡1subscript𝛼𝑡12superscript1subscript𝛼𝑡2𝐈\displaystyle=[(\sqrt{\alpha_{t}}\sqrt{1-\alpha_{t-1}})^{2}+(\sqrt{1-\alpha_{t% }})^{2}]\mathbf{I}= [ ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] bold_I =[αt(1αt1)+1αt]𝐈=[1αtαt1]𝐈.absentdelimited-[]subscript𝛼𝑡1subscript𝛼𝑡11subscript𝛼𝑡𝐈delimited-[]1subscript𝛼𝑡subscript𝛼𝑡1𝐈\displaystyle=[\alpha_{t}(1-\alpha_{t-1})+1-\alpha_{t}]\mathbf{I}=[1-\alpha_{t% }\alpha_{t-1}]\mathbf{I}.= [ italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( 1 - italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) + 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] bold_I = [ 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ] bold_I . Returning to Eqn (22), we can show that the recursion is updated to become a linear combination of 𝐱t2subscript𝐱𝑡2\mathbf{x}_{t-2}bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT and a noise vector ϵt2subscriptbold-italic-ϵ𝑡2\boldsymbol{\epsilon}_{t-2}bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT: 𝐱tsubscript𝐱𝑡\displaystyle\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =αtαt1𝐱t2+1αtαt1ϵt2absentsubscript𝛼𝑡subscript𝛼𝑡1subscript𝐱𝑡21subscript𝛼𝑡subscript𝛼𝑡1subscriptbold-italic-ϵ𝑡2\displaystyle=\sqrt{\alpha_{t}\alpha_{t-1}}\mathbf{x}_{t-2}+\sqrt{1-\alpha_{t}% \alpha_{t-1}}\boldsymbol{\epsilon}_{t-2}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT =αtαt1αt2𝐱t3+1αtαt1αt2ϵt3absentsubscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡2subscript𝐱𝑡31subscript𝛼𝑡subscript𝛼𝑡1subscript𝛼𝑡2subscriptbold-italic-ϵ𝑡3\displaystyle=\sqrt{\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\mathbf{x}_{t-3}+\sqrt{% 1-\alpha_{t}\alpha_{t-1}\alpha_{t-2}}\boldsymbol{\epsilon}_{t-3}= square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 3 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 3 end_POSTSUBSCRIPT =absent\displaystyle=\vdots= ⋮ =i=1tαi𝐱0+1i=1tαiϵ0.absentsuperscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖subscript𝐱01superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖subscriptbold-italic-ϵ0\displaystyle=\sqrt{\prod_{i=1}^{t}\alpha_{i}}\mathbf{x}_{0}+\sqrt{1-\prod_{i=% 1}^{t}\alpha_{i}}\boldsymbol{\epsilon}_{0}.= square-root start_ARG ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (23) So, if we define α¯t=i=1tαisubscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡subscript𝛼𝑖\overline{\alpha}_{t}=\prod_{i=1}^{t}\alpha_{i}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can show that 𝐱t=α¯t𝐱0+1α¯tϵ0.subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡subscriptbold-italic-ϵ0\displaystyle\mathbf{x}_{t}=\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1% -\overline{\alpha}_{t}}\boldsymbol{\epsilon}_{0}.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (24) In other words, the distribution qϕ(𝐱t|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is 𝐱tqϕ(𝐱t|𝐱0)=𝒩(𝐱t|α¯t𝐱0,(1α¯t)𝐈).similar-tosubscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0𝒩conditionalsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\displaystyle\mathbf{x}_{t}\sim q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x% }_{0})=\mathcal{N}(\mathbf{x}_{t}\,|\,\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{% 0},\;\;(1-\overline{\alpha}_{t})\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) . (25)

    새로운 분포 qϕ(𝐱t|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )의 효용은 체인 𝐱0𝐱1𝐱T1𝐱Tsubscript𝐱0subscript𝐱1subscript𝐱𝑇1subscript𝐱𝑇\mathbf{x}_{0}\rightarrow\mathbf{x}_{1}\rightarrow\ldots\rightarrow\mathbf{x}_% {T-1}\rightarrow\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT → bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → … → bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT → bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT에 비해 그 원샷 포워드 확산 단계이다. 순방향 확산 모델의 모든 단계에서, 우리는 이미 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT를 알고 있고, 모든 서브시퀀스 천이가 가우시안이라고 가정하기 때문에, 임의의 t𝑡titalic_t에 대해 바로 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT를 알 것이다. 상황은 11로부터 이해할 수 있다.

    Refer to caption
    도 11:The difference between qϕ(𝐱t|𝐱t1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) and qϕ(𝐱t|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ).
    Example. For a Gaussian mixture model such that 𝐱p0(𝐱)=k=1Kπk𝒩(𝐱|𝝁k,σk2𝐈)similar-to𝐱subscript𝑝0𝐱superscriptsubscript𝑘1𝐾subscript𝜋𝑘𝒩conditional𝐱subscript𝝁𝑘superscriptsubscript𝜎𝑘2𝐈\mathbf{x}\sim p_{0}(\mathbf{x})=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(\mathbf{x}|% \boldsymbol{\mu}_{k},\sigma_{k}^{2}\mathbf{I})bold_x ∼ italic_p start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_x ) = ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_N ( bold_x | bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ), we can show that the distribution at time t𝑡titalic_t is pt(𝐱)subscript𝑝𝑡𝐱\displaystyle p_{t}(\mathbf{x})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) =k=1Kπk𝒩(𝐱|α¯t𝝁k,(1α¯t)𝐈+α¯tσk2𝐈)absentsuperscriptsubscript𝑘1𝐾subscript𝜋𝑘𝒩conditional𝐱subscript¯𝛼𝑡subscript𝝁𝑘1subscript¯𝛼𝑡𝐈subscript¯𝛼𝑡superscriptsubscript𝜎𝑘2𝐈\displaystyle=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(\mathbf{x}\;|\;\sqrt{\overline{% \alpha}_{t}}\boldsymbol{\mu}_{k},(1-\overline{\alpha}_{t})\mathbf{I}+\overline% {\alpha}_{t}\sigma_{k}^{2}\mathbf{I})= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_N ( bold_x | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I + over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) (26) =k=1Kπk𝒩(𝐱|αt𝝁k,(1αt)𝐈+αtσk2𝐈),if αt=αso that α¯t=i=1tα=αt.formulae-sequenceabsentsuperscriptsubscript𝑘1𝐾subscript𝜋𝑘𝒩conditional𝐱superscript𝛼𝑡subscript𝝁𝑘1superscript𝛼𝑡𝐈superscript𝛼𝑡superscriptsubscript𝜎𝑘2𝐈if subscript𝛼𝑡𝛼so that subscript¯𝛼𝑡superscriptsubscriptproduct𝑖1𝑡𝛼superscript𝛼𝑡\displaystyle=\sum_{k=1}^{K}\pi_{k}\mathcal{N}(\mathbf{x}\;|\;\sqrt{\alpha^{t}% }\boldsymbol{\mu}_{k},(1-\alpha^{t})\mathbf{I}+\alpha^{t}\sigma_{k}^{2}\mathbf% {I}),\qquad\text{if }\;\alpha_{t}=\alpha\;\;\text{so that }\;\overline{\alpha}% _{t}=\prod_{i=1}^{t}\alpha=\alpha^{t}.= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_N ( bold_x | square-root start_ARG italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT end_ARG bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ) bold_I + italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) , if italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_α so that over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_α = italic_α start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT .

    확률 분포 ptsubscript𝑝𝑡p_{t}italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT가 시간에 따라 어떻게 진화하는지 궁금하다면 t𝑡titalic_t 12에서 분포의 궤적을 보여준다. t=0𝑡0t=0italic_t = 0에 있을 때 초기 분포는 두 개의 Gaussians가 혼합되어 있음을 알 수 있다. Eqn(26)에서 정의된 전이를 따라 진행함에 따라 분포가 점차 단일 가우시안 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 )가 됨을 알 수 있다.

    Refer to caption
    그림 12: Trajectory plot of the Gaussian mixture, we progress to transit the probability distribution to 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ).

    동일한 플롯에서, 우리는 시간 t𝑡titalic_t의 함수로서 랜덤 샘플 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT의 몇 개의 순간 궤적을 오버레이하여 보여준다. 샘플을 생성하는 데 사용한 방정식은

    𝐱t=αt𝐱t1+1αtϵt1,ϵ𝒩(0,𝐈).formulae-sequencesubscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡subscriptbold-italic-ϵ𝑡1similar-tobold-italic-ϵ𝒩0𝐈\mathbf{x}_{t}=\sqrt{\alpha_{t}}\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}% \boldsymbol{\epsilon}_{t-1},\qquad\boldsymbol{\epsilon}\sim\mathcal{N}(0,% \mathbf{I}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + square-root start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) .

    보시다시피, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT의 궤적은 분포 pt(𝐱)subscript𝑝𝑡𝐱p_{t}(\mathbf{x})italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x )를 다소 따른다.

    2.4 Evidence Lower Bound

    이제 변분 확산 모델의 구조를 이해했으니, 우리는 ELBO를 적어서 모델을 훈련시킬 수 있다. 변이 확산 모델용 ELBO는

  • Reconstruction. 재구성 용어는 초기 블록을 기반으로 한다. 로그-우도 p𝜽(𝐱0|𝐱1)subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )를 이용하여 p𝜽subscript𝑝𝜽p_{\boldsymbol{\theta}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT와 연관된 신경망이 잠재변수 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT로부터 이미지 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT를 얼마나 잘 복구할 수 있는지를 측정한다. 기대는 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT를 생성하는 분포인 qϕ(𝐱1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )로부터 도출된 샘플에 대하여 취해진다. 우리가 qϕ(𝐱1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )로부터 샘플을 끌어내고자 하는 이유에 대해 어리둥절하다면, 샘플 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT가 어디에서 와야 하는지에 대해 생각해보자. 샘플 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT는 하늘에서 오는 것이 아니다. 이들은 중간 잠재 변수이기 때문에, 순방향 천이에 의해 created qϕ(𝐱1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )이다. 따라서 qϕ(𝐱1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )로부터 샘플을 생성해야 한다.

  • Prior Matching. The prior matching term is based on the final block. We use KL divergence to measure the difference between qϕ(𝐱T|𝐱T1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) and p(𝐱T)𝑝subscript𝐱𝑇p(\mathbf{x}_{T})italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). The first distribution qϕ(𝐱T|𝐱T1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) is the forward transition from 𝐱T1subscript𝐱𝑇1\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT to 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. This is how 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT is generated. The second distribution is p(𝐱T)𝑝subscript𝐱𝑇p(\mathbf{x}_{T})italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ). Because of our laziness, p(𝐱T)𝑝subscript𝐱𝑇p(\mathbf{x}_{T})italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) is 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ). We want qϕ(𝐱T|𝐱T1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) to be as close to 𝒩(0,𝐈)𝒩0𝐈\mathcal{N}(0,\mathbf{I})caligraphic_N ( 0 , bold_I ) as possible. The samples here are 𝐱T1subscript𝐱𝑇1\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT which are drawn from qϕ(𝐱T1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{T-1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) because qϕ(𝐱T1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{T-1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) provides the forward sample generation process.

  • Consistency. 일관성 항은 전이 블록을 기반으로 합니다. 두 가지 방법이 있습니다. 순방향 전이는 분포 qϕ(𝐱t|𝐱t1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )에 의해 결정되는 반면, 역방향 전이는 뉴럴 네트워크 p𝜽(𝐱t|𝐱t+1)subscript𝑝𝜽conditionalsubscript𝐱𝑡subscript𝐱𝑡1p_{\boldsymbol{\theta}}(\mathbf{x}_{t}|\mathbf{x}_{t+1})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )에 의해 결정된다. 일관성 항은 KL 발산을 사용하여 편차를 측정합니다. 기대는 공동 분포 qϕ(𝐱t1,𝐱t+1|𝐱0)subscript𝑞bold-italic-ϕsubscript𝐱𝑡1conditionalsubscript𝐱𝑡1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )로부터 도출된 샘플 (𝐱t1,𝐱t+1)subscript𝐱𝑡1subscript𝐱𝑡1(\mathbf{x}_{t-1},\mathbf{x}_{t+1})( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )에 대하여 취해진다. 아, qϕ(𝐱t1,𝐱t+1|𝐱0)subscript𝑞bold-italic-ϕsubscript𝐱𝑡1conditionalsubscript𝐱𝑡1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )란? 걱정 마 곧 처리하겠습니다.

  • 현재, 우리는 이 공식이 실행될 준비가 되지 않았기 때문에 훈련과 추론을 건너뛸 것이다. 한 가지 묘기를 더 논의한 다음 구현에 대해 이야기할 것입니다.

    Proof of Eqn (27). Let’s define the following notation: 𝐱0:T={𝐱0,,𝐱T}subscript𝐱:0𝑇subscript𝐱0subscript𝐱𝑇\mathbf{x}_{0:T}=\{\mathbf{x}_{0},\ldots,\mathbf{x}_{T}\}bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT = { bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } means the collection of all state variables from t=0𝑡0t=0italic_t = 0 to t=T𝑡𝑇t=Titalic_t = italic_T. We also recall that the prior distribution p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) is the distribution for the image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. So it is equivalent to p(𝐱0)𝑝subscript𝐱0p(\mathbf{x}_{0})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ). With these in mind, we can show that logp(𝐱)𝑝𝐱\displaystyle\log p(\mathbf{x})roman_log italic_p ( bold_x ) =logp(𝐱0)absent𝑝subscript𝐱0\displaystyle=\log p(\mathbf{x}_{0})= roman_log italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =logp(𝐱0:T)𝑑𝐱1:Tabsent𝑝subscript𝐱:0𝑇differential-dsubscript𝐱:1𝑇\displaystyle=\log\int p(\mathbf{x}_{0:T})d\mathbf{x}_{1:T}= roman_log ∫ italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) italic_d bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT Marginalize by integrating over 𝐱1:Tsubscript𝐱:1𝑇\mathbf{x}_{1:T}bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT =logp(𝐱0:T)qϕ(𝐱1:T|𝐱0)qϕ(𝐱1:T|𝐱0)𝑑𝐱1:Tabsent𝑝subscript𝐱:0𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0differential-dsubscript𝐱:1𝑇\displaystyle=\log\int p(\mathbf{x}_{0:T})\frac{{\color[rgb]{0,0,1}q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}}{{\color[rgb]{0,0,1}q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}}d\mathbf{x}_{1:T}= roman_log ∫ italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG italic_d bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT Multiply and divide qϕ(𝐱1:T|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =logqϕ(𝐱1:T|𝐱0)[p(𝐱0:T)qϕ(𝐱1:T|𝐱0)]𝑑𝐱1:Tabsentsubscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱:0𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0differential-dsubscript𝐱:1𝑇\displaystyle=\log\int{\color[rgb]{0,0,1}q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T% }|\mathbf{x}_{0})}\left[\frac{p(\mathbf{x}_{0:T})}{{\color[rgb]{0,0,1}q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}}\right]d\mathbf{x}_{1:T}= roman_log ∫ italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) [ divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] italic_d bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT Rearrange terms =log𝔼qϕ(𝐱1:T|𝐱0)[p(𝐱0:T)qϕ(𝐱1:T|𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱:0𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0\displaystyle=\log\mathbb{E}_{{\color[rgb]{0,0,1}q_{\boldsymbol{\phi}}(\mathbf% {x}_{1:T}|\mathbf{x}_{0})}}\left[\frac{p(\mathbf{x}_{0:T})}{{\color[rgb]{0,0,1% }q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}}\right]= roman_log blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] Definition of expectation.Definition of expectation\displaystyle\text{Definition of expectation}.Definition of expectation . Now, we need to use Jensen’s inequality, which states that for any random variable X𝑋Xitalic_X and any concave function f𝑓fitalic_f, it holds that f(𝔼[X])𝔼[f(X)]𝑓𝔼delimited-[]𝑋𝔼delimited-[]𝑓𝑋f(\mathbb{E}[X])\geq\mathbb{E}[f(X)]italic_f ( blackboard_E [ italic_X ] ) ≥ blackboard_E [ italic_f ( italic_X ) ]. By recognizing that f()=log()𝑓f(\cdot)=\log(\cdot)italic_f ( ⋅ ) = roman_log ( ⋅ ), we can show that logp(𝐱)𝑝𝐱\displaystyle\log p(\mathbf{x})roman_log italic_p ( bold_x ) =log𝔼qϕ(𝐱1:T|𝐱0)[p(𝐱0:T)qϕ(𝐱1:T|𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱:0𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0\displaystyle=\log\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x% }_{0})}\left[\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}% |\mathbf{x}_{0})}\right]= roman_log blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱0:T)qϕ(𝐱1:T|𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱:0𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0\displaystyle\geq\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}% _{0})}\left[\log\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1% :T}|\mathbf{x}_{0})}\right]≥ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] (28) Let’s take a closer look at p(𝐱0:T)𝑝subscript𝐱:0𝑇p(\mathbf{x}_{0:T})italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ). Inspecting Figure 8, we notice that if we want to decouple p(𝐱0:T)𝑝subscript𝐱:0𝑇p(\mathbf{x}_{0:T})italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ), we should do conditioning for 𝐱t1|𝐱tconditionalsubscript𝐱𝑡1subscript𝐱𝑡\mathbf{x}_{t-1}|\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This leads to: p(𝐱0:T)=p(𝐱T)t=1Tp(𝐱t1|𝐱t)=p(𝐱T)p(𝐱0|𝐱1)t=2Tp(𝐱t1|𝐱t).𝑝subscript𝐱:0𝑇𝑝subscript𝐱𝑇superscriptsubscriptproduct𝑡1𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡\displaystyle p(\mathbf{x}_{0:T})=p(\mathbf{x}_{T})\prod_{t=1}^{T}p(\mathbf{x}% _{t-1}|\mathbf{x}_{t})=p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})\prod_% {t=2}^{T}p(\mathbf{x}_{t-1}|\mathbf{x}_{t}).italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (29) As for qϕ(𝐱1:T|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), Figure 8 suggests that we need to do the conditioning for 𝐱t|𝐱t1conditionalsubscript𝐱𝑡subscript𝐱𝑡1\mathbf{x}_{t}|\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT. However, because of the sequential relationship, we can write qϕ(𝐱1:T|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0\displaystyle q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =t=1Tqϕ(𝐱t|𝐱t1)=qϕ(𝐱T|𝐱T1)t=1T1qϕ(𝐱t|𝐱t1).absentsuperscriptsubscriptproduct𝑡1𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1superscriptsubscriptproduct𝑡1𝑇1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle=\prod_{t=1}^{T}q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{% t-1})=q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})\prod_{t=1}^{T-1}q% _{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1}).= ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) . (30) Substituting Eqn (29) and Eqn (30) back to Eqn (28), we can show that logp(𝐱)𝑝𝐱\displaystyle\log p(\mathbf{x})roman_log italic_p ( bold_x ) 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱0:T)qϕ(𝐱1:T|𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱:0𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0\displaystyle\geq\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}% _{0})}\left[\log\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1% :T}|\mathbf{x}_{0})}\right]≥ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)t=2Tp(𝐱t1|𝐱t)qϕ(𝐱T|𝐱T1)t=1T1qϕ(𝐱t|𝐱t1)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1superscriptsubscriptproduct𝑡1𝑇1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log\frac{{\color[rgb]{0,0,1}p(\mathbf{x}_{T})p(\mathbf{x}_{0}|% \mathbf{x}_{1})\prod_{t=2}^{T}p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}}{{\color[rgb% ]{0,0,1}q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})\prod_{t=1}^{T-1% }q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)t=1T1p(𝐱t|𝐱t+1)qϕ(𝐱T|𝐱T1)t=1T1qϕ(𝐱t|𝐱t1)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1superscriptsubscriptproduct𝑡1𝑇1𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1superscriptsubscriptproduct𝑡1𝑇1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1}){\color[% rgb]{0,0,1}\prod_{t=1}^{T-1}p(\mathbf{x}_{t}|\mathbf{x}_{t+1})}}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})\prod_{t=1}^{T-1}q_{% \boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ] shift t𝑡titalic_t to t+1𝑡1t+1italic_t + 1 =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)qϕ(𝐱T|𝐱T1)]+𝔼qϕ(𝐱1:T|𝐱0)[logt=1T1p(𝐱t|𝐱t+1)qϕ(𝐱t|𝐱t1)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]superscriptsubscriptproduct𝑡1𝑇1𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})}\right]+\mathbb{E}_{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\prod_{t=1}^{T-1% }\frac{p(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t% }|\mathbf{x}_{t-1})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) end_ARG ] + blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ] split expectation The first term above can be further decomposed into two expectations 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)qϕ(𝐱T|𝐱T1)]subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0}% )}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1})}\right]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱0|𝐱1)]Reconstruction+𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)qϕ(𝐱T|𝐱T1)]Prior Matching.absentReconstructionsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝conditionalsubscript𝐱0subscript𝐱1Prior Matchingsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1\displaystyle=\underset{\text{Reconstruction}}{\underbrace{\mathbb{E}_{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\bigg{[}\log p(\mathbf{x}_% {0}|\mathbf{x}_{1})\bigg{]}}}+\underset{\text{Prior Matching}}{\underbrace{% \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log% \frac{p(\mathbf{x}_{T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{T-1}% )}\right]}}.= underReconstruction start_ARG under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] end_ARG end_ARG + underPrior Matching start_ARG under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) end_ARG ] end_ARG end_ARG . The Reconstruction term can be simplified as 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱0|𝐱1)]subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝conditionalsubscript𝐱0subscript𝐱1\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0}% )}\bigg{[}\log p(\mathbf{x}_{0}|\mathbf{x}_{1})\bigg{]}blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] =𝔼qϕ(𝐱1|𝐱0)[logp(𝐱0|𝐱1)],absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0delimited-[]𝑝conditionalsubscript𝐱0subscript𝐱1\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})% }\bigg{[}\log p(\mathbf{x}_{0}|\mathbf{x}_{1})\bigg{]},= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] , where we used the fact that the conditioning 𝐱1:T|𝐱0conditionalsubscript𝐱:1𝑇subscript𝐱0\mathbf{x}_{1:T}|\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is equivalent to 𝐱1|𝐱0conditionalsubscript𝐱1subscript𝐱0\mathbf{x}_{1}|\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The Prior Matching term is 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)qϕ(𝐱T|𝐱T1)]subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0}% )}\left[\log\frac{p(\mathbf{x}_{T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|% \mathbf{x}_{T-1})}\right]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱T,𝐱T1|𝐱0)[logp(𝐱T)qϕ(𝐱T|𝐱T1)]absentsubscript𝔼subscript𝑞bold-italic-ϕsubscript𝐱𝑇conditionalsubscript𝐱𝑇1subscript𝐱0delimited-[]𝑝subscript𝐱𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{T},\mathbf{x}_{T-1% }|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{T})}{q_{\boldsymbol{\phi}}(% \mathbf{x}_{T}|\mathbf{x}_{T-1})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱T1,𝐱T|𝐱0)[𝔻KL(qϕ(𝐱T|𝐱T1)p(𝐱T))],absentsubscript𝔼subscript𝑞bold-italic-ϕsubscript𝐱𝑇1conditionalsubscript𝐱𝑇subscript𝐱0delimited-[]subscript𝔻KLconditionalsubscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1𝑝subscript𝐱𝑇\displaystyle=-\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{T-1},\mathbf{x}_{% T}|\mathbf{x}_{0})}\Bigg{[}\mathbb{D}_{\text{KL}}\left(q_{\boldsymbol{\phi}}(% \mathbf{x}_{T}|\mathbf{x}_{T-1})\|p(\mathbf{x}_{T})\right)\Bigg{]},= - blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) ] , where we notice that the conditional expectation can be simplified to samples 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝐱T1subscript𝐱𝑇1\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT only, because logp(𝐱T)qϕ(𝐱T|𝐱T1)𝑝subscript𝐱𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱𝑇1\log\frac{p(\mathbf{x}_{T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{% T-1})}roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT ) end_ARG only depends on 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and 𝐱T1subscript𝐱𝑇1\mathbf{x}_{T-1}bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT. Finally, we look at the product term. We can show that 𝔼qϕ(𝐱1:T|𝐱0)[logt=1T1p(𝐱t|𝐱t+1)qϕ(𝐱t|𝐱t1)]subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]superscriptsubscriptproduct𝑡1𝑇1𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0}% )}\left[\log\prod_{t=1}^{T-1}\frac{p(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}\right]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ∏ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ] =t=1T1𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱t|𝐱t+1)qϕ(𝐱t|𝐱t1)]absentsuperscriptsubscript𝑡1𝑇1subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle=\sum_{t=1}^{T-1}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:% T}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x}_{t}|\mathbf{x}_{t+1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})}\right]= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ] =t=1T1𝔼qϕ(𝐱t1,𝐱t,𝐱t+1|𝐱0)[logp(𝐱t|𝐱t+1)qϕ(𝐱t|𝐱t1)]absentsuperscriptsubscript𝑡1𝑇1subscript𝔼subscript𝑞bold-italic-ϕsubscript𝐱𝑡1subscript𝐱𝑡conditionalsubscript𝐱𝑡1subscript𝐱0delimited-[]𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1\displaystyle=\sum_{t=1}^{T-1}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-% 1},\mathbf{x}_{t},\mathbf{x}_{t+1}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{x% }_{t}|\mathbf{x}_{t+1})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1}% )}\right]= ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG ] =t=1T1𝔼qϕ(𝐱t1,𝐱t+1|𝐱0)[𝔻KL(qϕ(𝐱t|𝐱t1)p(𝐱t|𝐱t+1))]consistency.\displaystyle=\underset{\text{consistency}}{\underbrace{-\sum_{t=1}^{T-1}% \mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}% _{0})}\Bigg{[}\mathbb{D}_{\text{KL}}\left(q_{\boldsymbol{\phi}}(\mathbf{x}_{t}% |\mathbf{x}_{t-1})\|p(\mathbf{x}_{t}|\mathbf{x}_{t+1})\right)\Bigg{]}}}.= underconsistency start_ARG under⏟ start_ARG - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T - 1 end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ) ] end_ARG end_ARG . By replacing p(𝐱0|𝐱1)𝑝conditionalsubscript𝐱0subscript𝐱1p(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) with p𝜽(𝐱0|𝐱1)subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and p(𝐱t|𝐱t+1)𝑝conditionalsubscript𝐱𝑡subscript𝐱𝑡1p(\mathbf{x}_{t}|\mathbf{x}_{t+1})italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) with p𝜽(𝐱t|𝐱t+1)subscript𝑝𝜽conditionalsubscript𝐱𝑡subscript𝐱𝑡1p_{\boldsymbol{\theta}}(\mathbf{x}_{t}|\mathbf{x}_{t+1})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ), we are done.

    2.5 Rewrite the Consistency Term

    위 변동 확산 모델의 악몽은 공동 분포 qϕ(𝐱t1,𝐱t+1|𝐱0)subscript𝑞bold-italic-ϕsubscript𝐱𝑡1conditionalsubscript𝐱𝑡1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )로부터 샘플 (𝐱t1,𝐱t+1)subscript𝐱𝑡1subscript𝐱𝑡1(\mathbf{x}_{t-1},\mathbf{x}_{t+1})( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )를 끌어낼 필요가 있다는 것이다. 우리는 qϕ(𝐱t1,𝐱t+1|𝐱0)subscript𝑞bold-italic-ϕsubscript𝐱𝑡1conditionalsubscript𝐱𝑡1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1},\mathbf{x}_{t+1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )가 무엇인지 모른다! 음, 물론 가우시안이지만, 여전히 우리는 현재 샘플 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT를 그리기 위해 미래의 샘플 𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT를 사용할 필요가 있다. 이것은 이상하고 재미가 없다.

    일관성 항을 검사하면 qϕ(𝐱t|𝐱t1)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT )p𝜽(𝐱t|𝐱t+1)subscript𝑝𝜽conditionalsubscript𝐱𝑡subscript𝐱𝑡1p_{\boldsymbol{\theta}}(\mathbf{x}_{t}|\mathbf{x}_{t+1})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )가 두 개의 반대 방향을 따라 이동하고 있음을 알 수 있다. 따라서, 𝐱t1subscript𝐱𝑡1\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT𝐱t+1subscript𝐱𝑡1\mathbf{x}_{t+1}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT를 사용할 필요가 있는 것은 불가피하다. 우리가 질문해야 할 질문은 우리가 일관성을 확인할 수 있는 동안 서로 다른 두 방향을 다룰 필요가 없도록 무언가를 생각해낼 수 있는가이다.

    자, 여기 베이즈 정리라고 불리는 간단한 요령이 있습니다.

    q(𝐱t|𝐱t1)=q(𝐱t1|𝐱t)q(𝐱t)q(𝐱t1)condition on 𝐱0q(𝐱t|𝐱t1,𝐱0)=q(𝐱t1|𝐱t,𝐱0)q(𝐱t|𝐱0)q(𝐱t1|𝐱0).formulae-sequence𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝑞subscript𝐱𝑡𝑞subscript𝐱𝑡1condition on 𝐱0𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝑞conditionalsubscript𝐱𝑡1subscript𝐱0q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{t})q(% \mathbf{x}_{t})}{q(\mathbf{x}_{t-1})}\quad\overset{\text{\scriptsize{condition% on $\mathbf{x}_{0}$}}}{\Longrightarrow}\quad q(\mathbf{x}_{t}|\mathbf{x}_{t-1% },{\color[rgb]{0,0,1}\mathbf{x}_{0}})=\frac{q(\mathbf{x}_{t-1}|\mathbf{x}_{t},% {\color[rgb]{0,0,1}\mathbf{x}_{0}})q(\mathbf{x}_{t}|{\color[rgb]{0,0,1}\mathbf% {x}_{0}})}{q(\mathbf{x}_{t-1}|{\color[rgb]{0,0,1}\mathbf{x}_{0}})}.italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = divide start_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG overcondition on x0 start_ARG ⟹ end_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG . (31)

    With this change of the conditioning order, we can switch q(𝐱t|𝐱t1,𝐱0)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0q(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) to q(𝐱t1|𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) by adding one more condition variable 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The direction q(𝐱t1|𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is now parallel to p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as shown in Figure 13. So, if we want to rewrite the consistency term, a natural option is to calculate the KL divergence between qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

    Refer to caption
    도 13:Eqn에서 Bayes 정리를 고려하면(31), p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )와 평행한 방향을 갖는 분포 qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )를 정의할 수 있다.

    만약 우리가 몇몇 (boring) 대수적 유도체를 통과한다면, 우리는 ELBO가 지금이라는 것을 보여줄 수 있다: The ELBO for a variational diffusion model is

  • Reconstruction. 새로운 재구성 용어는 이전과 동일합니다. 우리는 여전히 로그 가능성을 최대화하고 있다.

  • Prior Matching. 새로운 사전 매칭은 qϕ(𝐱T|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )p(𝐱T)𝑝subscript𝐱𝑇p(\mathbf{x}_{T})italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) 사이의 KL 발산으로 단순화된다. 그 변화는 우리가 지금 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT를 조건으로 하고 있다는 사실 때문이다. 따라서, qϕ(𝐱T1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{T-1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )로부터 샘플을 끌어와 기대치를 취할 필요가 없다.

  • Consistency. 새로운 일관성 항은 이전 항과 두 가지 면에서 다르다. 먼저, 실행중인 인덱스 t𝑡titalic_tt=2𝑡2t=2italic_t = 2에서 시작하고 t=T𝑡𝑇t=Titalic_t = italic_T에서 종료한다. 이전에는 t=1𝑡1t=1italic_t = 1부터 t=T1𝑡𝑇1t=T-1italic_t = italic_T - 1까지였다. 이와 함께 분포 매칭이 있는데, 이는 이제 qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 사이에 있다. 그래서, 순방향 천이를 역방향 천이와 매칭하도록 요구하는 대신에, qϕsubscript𝑞bold-italic-ϕq_{\boldsymbol{\phi}}italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT를 이용하여 역방향 천이를 구성하고, 이를 이용하여 p𝜽subscript𝑝𝜽p_{\boldsymbol{\theta}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT와 매칭시킨다.

  • Proof of Eqn (32). We begin with Eqn (28) by showing that logp(𝐱)𝑝𝐱\displaystyle\log p(\mathbf{x})roman_log italic_p ( bold_x ) 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱0:T)qϕ(𝐱1:T|𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱:0𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0\displaystyle\geq\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}% _{0})}\left[\log\frac{p(\mathbf{x}_{0:T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1% :T}|\mathbf{x}_{0})}\right]≥ blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT 0 : italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] By Eqn (28) =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)t=2Tp(𝐱t1|𝐱t)qϕ(𝐱1|𝐱0)t=2Tqϕ(𝐱t|𝐱t1,𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0superscriptsubscriptproduct𝑡2𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})\prod_{t=2% }^{T}p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|% \mathbf{x}_{0})\prod_{t=2}^{T}q_{\boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_% {t-1},\mathbf{x}_{0})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] split the chain =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)qϕ(𝐱1|𝐱0)]+𝔼qϕ(𝐱1:T|𝐱0)[logt=2Tp(𝐱t1|𝐱t)qϕ(𝐱t|𝐱t1,𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}\right]+\mathbb{E}_{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\prod_{t=2}^{T}% \frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t}% |\mathbf{x}_{t-1},\mathbf{x}_{0})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] + blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] (33) Let’s consider the second term: t=2Tp(𝐱t1|𝐱t)qϕ(𝐱t|𝐱t1,𝐱0)superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0\displaystyle\prod_{t=2}^{T}\frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{t-1},\mathbf{x}_{0})}∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG =t=2Tp(𝐱t1|𝐱t)qϕ(𝐱t1|𝐱t,𝐱0)qϕ(𝐱t|𝐱0)qϕ(𝐱t1|𝐱0)absentsuperscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱0\displaystyle=\prod_{t=2}^{T}\frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{\frac{q% _{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})q_{% \boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})}{q_{\boldsymbol{\phi}}(% \mathbf{x}_{t-1}|\mathbf{x}_{0})}}= ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG end_ARG Bayes rule, Eqn (31) =t=2Tp(𝐱t1|𝐱t)qϕ(𝐱t1|𝐱t,𝐱0)×t=2Tqϕ(𝐱t1|𝐱0)qϕ(𝐱t|𝐱0)absentsuperscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0superscriptsubscriptproduct𝑡2𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱0\displaystyle=\prod_{t=2}^{T}\frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})}\times\prod% _{t=2}^{T}\frac{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{0})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{t}|\mathbf{x}_{0})}= ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG × ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG Rearrange denominator =t=2Tp(𝐱t1|𝐱t)qϕ(𝐱t1|𝐱t,𝐱0)×qϕ(𝐱1|𝐱0)qϕ(𝐱T|𝐱0),absentsuperscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱0\displaystyle=\prod_{t=2}^{T}\frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})}\times\frac% {q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}{q_{\boldsymbol{\phi}}(% \mathbf{x}_{T}|\mathbf{x}_{0})},= ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG × divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG , Recursion cancels terms where the last equation uses the fact that for any sequence a1,,aTsubscript𝑎1subscript𝑎𝑇a_{1},\ldots,a_{T}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, we have t=2Tat1at=a1a2×a2a3××aT1aT=a1aTsuperscriptsubscriptproduct𝑡2𝑇subscript𝑎𝑡1subscript𝑎𝑡subscript𝑎1subscript𝑎2subscript𝑎2subscript𝑎3subscript𝑎𝑇1subscript𝑎𝑇subscript𝑎1subscript𝑎𝑇\prod_{t=2}^{T}\frac{a_{t-1}}{a_{t}}=\frac{a_{1}}{a_{2}}\times\frac{a_{2}}{a_{% 3}}\times\ldots\times\frac{a_{T-1}}{a_{T}}=\frac{a_{1}}{a_{T}}∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG × divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG × … × divide start_ARG italic_a start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG = divide start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG. Going back to the Eqn (33), we can see that 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)qϕ(𝐱1|𝐱0)]+𝔼qϕ(𝐱1:T|𝐱0)[logt=2Tp(𝐱t1|𝐱t)qϕ(𝐱t|𝐱t1,𝐱0)]subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱0\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0}% )}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}\right]+\mathbb{E}_{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\prod_{t=2}^{T}% \frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t}% |\mathbf{x}_{t-1},\mathbf{x}_{0})}\right]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] + blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)qϕ(𝐱1|𝐱0)+logqϕ(𝐱1|𝐱0)qϕ(𝐱T|𝐱0)]+𝔼qϕ(𝐱1:T|𝐱0)[logt=2Tp(𝐱t1|𝐱t)qϕ(𝐱t1|𝐱t,𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱0subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}+\log\frac{q_{\boldsymbol{% \phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|% \mathbf{x}_{0})}\right]+\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|% \mathbf{x}_{0})}\left[\log\prod_{t=2}^{T}\frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{% t})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG + roman_log divide start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] + blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)qϕ(𝐱T|𝐱0)]+𝔼qϕ(𝐱1:T|𝐱0)[logt=2Tp(𝐱t1|𝐱t)qϕ(𝐱t1|𝐱t,𝐱0)],absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱0subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{0})}\right]+\mathbb{E}_{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\prod_{t=2}^{T}% \frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-% 1}|\mathbf{x}_{t},\mathbf{x}_{0})}\right],= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] + blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] , where we canceled qϕ(𝐱1|𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) in the numerator and denominator since logab+logbc=logac𝑎𝑏𝑏𝑐𝑎𝑐\log\frac{a}{b}+\log\frac{b}{c}=\log\frac{a}{c}roman_log divide start_ARG italic_a end_ARG start_ARG italic_b end_ARG + roman_log divide start_ARG italic_b end_ARG start_ARG italic_c end_ARG = roman_log divide start_ARG italic_a end_ARG start_ARG italic_c end_ARG for any positive constants a𝑎aitalic_a, b𝑏bitalic_b, and c𝑐citalic_c. This will give us 𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)p(𝐱0|𝐱1)qϕ(𝐱T|𝐱0)]subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱0\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0}% )}\left[\log\frac{p(\mathbf{x}_{T})p(\mathbf{x}_{0}|\mathbf{x}_{1})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{0})}\right]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱0|𝐱1)]+𝔼qϕ(𝐱1:T|𝐱0)[logp(𝐱T)qϕ(𝐱T|𝐱0)]absentsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝conditionalsubscript𝐱0subscript𝐱1subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]𝑝subscript𝐱𝑇subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱0\displaystyle=\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0% })}\left[\log p(\mathbf{x}_{0}|\mathbf{x}_{1})\right]+\mathbb{E}_{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0})}\left[\log\frac{p(\mathbf{% x}_{T})}{q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{0})}\right]= blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] + blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] =𝔼qϕ(𝐱1|𝐱0)[logp(𝐱0|𝐱1)]reconstruction𝔻KL(qϕ(𝐱T|𝐱0)p(𝐱T))prior matching.absentreconstructionsubscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱1subscript𝐱0delimited-[]𝑝conditionalsubscript𝐱0subscript𝐱1prior matchingsubscript𝔻KLconditionalsubscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑇subscript𝐱0𝑝subscript𝐱𝑇\displaystyle=\underset{\text{reconstruction}}{\underbrace{\mathbb{E}_{q_{% \boldsymbol{\phi}}(\mathbf{x}_{1}|\mathbf{x}_{0})}\left[\log p(\mathbf{x}_{0}|% \mathbf{x}_{1})\right]}}-\underset{\text{prior matching}}{\underbrace{\mathbb{% D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{x}_{T}|\mathbf{x}_{0})\|p(\mathbf% {x}_{T}))}}.= underreconstruction start_ARG under⏟ start_ARG blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] end_ARG end_ARG - underprior matching start_ARG under⏟ start_ARG blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) end_ARG end_ARG . The last term is 𝔼qϕ(𝐱1:T|𝐱0)[logt=2Tp(𝐱t1|𝐱t)qϕ(𝐱t1|𝐱t,𝐱0)]subscript𝔼subscript𝑞bold-italic-ϕconditionalsubscript𝐱:1𝑇subscript𝐱0delimited-[]superscriptsubscriptproduct𝑡2𝑇𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{1:T}|\mathbf{x}_{0}% )}\left[\log\prod_{t=2}^{T}\frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}{q_{% \boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})}\right]blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log ∏ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ] =t=2𝔼qϕ(𝐱t,𝐱t1|𝐱0)logp(𝐱t1|𝐱t)qϕ(𝐱t1|𝐱t,𝐱0)absentsubscript𝑡2subscript𝔼subscript𝑞bold-italic-ϕsubscript𝐱𝑡conditionalsubscript𝐱𝑡1subscript𝐱0𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle=\sum_{t=2}\mathbb{E}_{q_{\boldsymbol{\phi}}(\mathbf{x}_{t},% \mathbf{x}_{t-1}|\mathbf{x}_{0})}\log\frac{p(\mathbf{x}_{t-1}|\mathbf{x}_{t})}% {q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})}= ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT roman_log divide start_ARG italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG =t=2𝔼qϕ(𝐱t,𝐱t1|𝐱0)𝔻KL(qϕ(𝐱t1|𝐱t,𝐱0)p(𝐱t1|𝐱t))consistency.\displaystyle=-\underset{\text{consistency}}{\underbrace{\sum_{t=2}\mathbb{E}_% {q_{\boldsymbol{\phi}}(\mathbf{x}_{t},\mathbf{x}_{t-1}|\mathbf{x}_{0})}\mathbb% {D}_{\text{KL}}(q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{% x}_{0})\|p(\mathbf{x}_{t-1}|\mathbf{x}_{t}))}}.= - underconsistency start_ARG under⏟ start_ARG ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) end_ARG end_ARG . Finally, replace p(𝐱t1|𝐱t)𝑝conditionalsubscript𝐱𝑡1subscript𝐱𝑡p(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) by p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and p(𝐱0|𝐱1)𝑝conditionalsubscript𝐱0subscript𝐱1p(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) by p𝜽(𝐱0|𝐱1)subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). Done!

    2.6 Derivation of qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

    이제 변분 확산 모델을 위한 새로운 ELBO를 알았으므로, 우리는 그것의 핵심 구성 요소인 qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )에 대해 논의하는 데 약간의 시간을 보내야 한다. 간단히 말해서, 우리가 보여주고 싶은 것은

    • qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )는 생각보다 미치지 못한다. 그것은 여전히 가우시안이다.

    • 가우시안이기 때문에 평균과 공분산이 완전히 특징이다. 알고 보니

      qϕ(𝐱t1|𝐱t,𝐱0)=𝒩(𝐱t1|𝐱t+𝐱0,𝐈),subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝒩conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝐈q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})=\mathcal% {N}(\mathbf{x}_{t-1}\,|\,\heartsuit\mathbf{x}_{t}+\spadesuit\mathbf{x}_{0},% \clubsuit\mathbf{I}),italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | ♡ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ♠ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ♣ bold_I ) , (34)

      for some magical scalars \heartsuit, \spadesuit and \clubsuit defined below.

    The distribution qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) takes the form of qϕ(𝐱t1|𝐱t,𝐱0)=𝒩(𝐱t1|𝝁q(𝐱t,𝐱0),𝚺q(t)),subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝒩conditionalsubscript𝐱𝑡1subscript𝝁𝑞subscript𝐱𝑡subscript𝐱0subscript𝚺𝑞𝑡q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})=\mathcal% {N}(\mathbf{x}_{t-1}\,|\,\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0}),% \boldsymbol{\Sigma}_{q}(t)),italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , bold_Σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) ) , (35) where 𝝁q(𝐱t,𝐱0)subscript𝝁𝑞subscript𝐱𝑡subscript𝐱0\displaystyle\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =(1α¯t1)αt1α¯t𝐱t+(1αt)α¯t11α¯t𝐱0absent1subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝐱0\displaystyle=\frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1-\overline{% \alpha}_{t}}\mathbf{x}_{t}+\frac{(1-\alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}% {1-\overline{\alpha}_{t}}\mathbf{x}_{0}= divide start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT (36) 𝚺q(t)subscript𝚺𝑞𝑡\displaystyle\boldsymbol{\Sigma}_{q}(t)bold_Σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) =(1αt)(1α¯t1)1α¯t𝐈=defσq2(t)𝐈.absent1subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡𝐈defsuperscriptsubscript𝜎𝑞2𝑡𝐈\displaystyle=\frac{(1-\alpha_{t})(1-\sqrt{\overline{\alpha}_{t-1}})}{1-% \overline{\alpha}_{t}}\mathbf{I}\overset{\text{def}}{=}\sigma_{q}^{2}(t)% \mathbf{I}.= divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_I overdef start_ARG = end_ARG italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I . (37)

    Eqn(35)의 흥미로운 부분은 qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )completely characterized by 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT이다. 평균과 분산을 추정하는 데 필요한 신경망이 없습니다! (네트워크가 필요한 VAE와 비교할 수 있습니다.) 네트워크가 필요하지 않기 때문에 실제로 “학습”할 것이 없습니다. qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT를 알면 자동으로 결정된다. 추측도, 추정도, 아무것도 없다.

    여기서의 깨달음은 중요하다. 일관성 항을 살펴보면, t𝑡titalic_t번째 항이 있는 많은 KL 발산 항들의 합이다.

    𝔻KL(qϕ(𝐱t1|𝐱t,𝐱0)nothing to learnp𝜽(𝐱t1|𝐱t)need to do something).\mathbb{D}_{\text{KL}}(\underset{\quad\text{nothing to learn}\quad}{% \underbrace{q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{% 0})}}\|\underset{\text{need to do something}}{\underbrace{p_{\boldsymbol{% \theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})}}).blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( start_UNDERACCENT nothing to learn end_UNDERACCENT start_ARG under⏟ start_ARG italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG end_ARG ∥ underneed to do something start_ARG under⏟ start_ARG italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG ) . (38)

    우리가 방금 말한 것처럼 qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )와는 아무런 관계가 없다. 그러나 우리는 KL 발산을 계산할 수 있도록 p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )에 무언가를 해야 한다.

    그럼 어떻게 할까요? qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )는 가우시안인 것을 알고 있다. KL 발산을 빠르게 계산하려면 분명히 assume p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )도 가우시안일 필요가 있다. 그래, 정말이야 우리는 그것이 왜 가우시안인지 정당화할 수 없다. 그러나 p𝜽subscript𝑝𝜽p_{\boldsymbol{\theta}}italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT는 분포이기 때문에 choose은 물론 더 쉬운 것을 선택해야 한다. 이를 위해 우리는

    p𝜽(𝐱t1|𝐱t)=𝒩(𝐱t1|𝝁𝜽(𝐱t)neural network,σq2(t)𝐈),subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝒩conditionalsubscript𝐱𝑡1neural networksubscript𝝁𝜽subscript𝐱𝑡superscriptsubscript𝜎𝑞2𝑡𝐈p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}\Big{(}% \mathbf{x}_{t-1}|\underset{\text{neural network}}{\underbrace{\boldsymbol{\mu}% _{\boldsymbol{\theta}}(\mathbf{x}_{t})}},\sigma_{q}^{2}(t)\mathbf{I}\Big{)},italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | underneural network start_ARG under⏟ start_ARG bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I ) , (39)

    여기서 우리는 평균 벡터가 신경망을 사용하여 결정될 수 있다고 가정한다. 분산으로는 chooseσq2(t)superscriptsubscript𝜎𝑞2𝑡\sigma_{q}^{2}(t)italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t )가 될 분산이다. 이것은 identical to Eqn (37)이다! 따라서, Eqn(35)을 p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )와 나란히 놓으면, 우리는 둘 사이의 병렬 관계에 주목한다:

    qϕ(𝐱t1|𝐱t,𝐱0)subscript𝑞bold-italic-ϕconditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle q_{\boldsymbol{\phi}}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}% _{0})italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =𝒩(𝐱t1|𝝁q(𝐱t,𝐱0)known,σq2(t)𝐈known),absent𝒩conditionalsubscript𝐱𝑡1knownsubscript𝝁𝑞subscript𝐱𝑡subscript𝐱0knownsuperscriptsubscript𝜎𝑞2𝑡𝐈\displaystyle=\mathcal{N}\Big{(}\mathbf{x}_{t-1}\,|\,\underset{\text{known}}{% \underbrace{\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})}},\underset{% \text{known}}{\underbrace{\sigma_{q}^{2}(t)\mathbf{I}}}\Big{)},= caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | underknown start_ARG under⏟ start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG end_ARG , underknown start_ARG under⏟ start_ARG italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I end_ARG end_ARG ) , (40)
    p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡\displaystyle p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒩(𝐱t1|𝝁𝜽(𝐱t)neural network,σq2(t)𝐈known).absent𝒩conditionalsubscript𝐱𝑡1neural networksubscript𝝁𝜽subscript𝐱𝑡knownsuperscriptsubscript𝜎𝑞2𝑡𝐈\displaystyle=\mathcal{N}\Big{(}\mathbf{x}_{t-1}|\underset{\text{neural % network}}{\underbrace{\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})}}% ,\underset{\text{known}}{\underbrace{\sigma_{q}^{2}(t)\mathbf{I}}}\Big{)}.= caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | underneural network start_ARG under⏟ start_ARG bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG , underknown start_ARG under⏟ start_ARG italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I end_ARG end_ARG ) . (41)

    따라서, KL 발산은 이하와 같이 단순화된다.

    𝔻KL(qϕ(𝐱t1|𝐱t,𝐱0)p𝜽(𝐱t1|𝐱t))\displaystyle\mathbb{D}_{\text{KL}}\Big{(}q_{\boldsymbol{\phi}}(\mathbf{x}_{t-% 1}|\mathbf{x}_{t},\mathbf{x}_{0})\|p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|% \mathbf{x}_{t})\Big{)}blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT bold_italic_ϕ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) )
    =𝔻KL(𝒩(𝐱t1|𝝁q(𝐱t,𝐱0),σq2(t)𝐈)𝒩(𝐱t1|𝝁𝜽(𝐱t),σq2(t)𝐈))\displaystyle=\mathbb{D}_{\text{KL}}\Big{(}\mathcal{N}(\mathbf{x}_{t-1}\,|\,% \boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0}),\sigma_{q}^{2}(t)\mathbf{I% })\|\mathcal{N}(\mathbf{x}_{t-1}\,|\,\boldsymbol{\mu}_{\boldsymbol{\theta}}(% \mathbf{x}_{t}),\sigma_{q}^{2}(t)\mathbf{I})\Big{)}= blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I ) ∥ caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I ) )
    =12σq2(t)𝝁q(𝐱t,𝐱0)𝝁𝜽(𝐱t)2,absent12superscriptsubscript𝜎𝑞2𝑡superscriptnormsubscript𝝁𝑞subscript𝐱𝑡subscript𝐱0subscript𝝁𝜽subscript𝐱𝑡2\displaystyle=\frac{1}{2\sigma_{q}^{2}(t)}\|\boldsymbol{\mu}_{q}(\mathbf{x}_{t% },\mathbf{x}_{0})-\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})\|^{2},= divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (42)

    여기서 우리는 두 개의 동일한 분산 가우스시안 사이의 KL 발산이 두 평균 벡터 사이의 유클리드 거리 제곱이라는 사실을 사용했다.

    Eqn에서 ELBO의 정의로 돌아가면(32), 이를 다시 쓸 수 있다.

    ELBO𝜽(𝐱)subscriptELBO𝜽𝐱\displaystyle\text{ELBO}_{\boldsymbol{\theta}}(\mathbf{x})ELBO start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) =𝔼q(𝐱1|𝐱0)[logp𝜽(𝐱0|𝐱1)]𝔻KL(q(𝐱T|𝐱0)p(𝐱T))nothing to trainabsentsubscript𝔼𝑞conditionalsubscript𝐱1subscript𝐱0delimited-[]subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1nothing to trainsubscript𝔻KLconditional𝑞conditionalsubscript𝐱𝑇subscript𝐱0𝑝subscript𝐱𝑇\displaystyle=\mathbb{E}_{q(\mathbf{x}_{1}|\mathbf{x}_{0})}[\log p_{% \boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})]-\underset{\text{nothing % to train}}{\underbrace{\mathbb{D}_{\text{KL}}\Big{(}q(\mathbf{x}_{T}|\mathbf{x% }_{0})\|p(\mathbf{x}_{T})\Big{)}}}= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] - undernothing to train start_ARG under⏟ start_ARG blackboard_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_q ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_p ( bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ) end_ARG end_ARG
    t=2T𝔼q(𝐱t|𝐱0)[12σq2(t)𝝁q(𝐱t,𝐱0)𝝁𝜽(𝐱t)2].superscriptsubscript𝑡2𝑇subscript𝔼𝑞conditionalsubscript𝐱𝑡subscript𝐱0delimited-[]12superscriptsubscript𝜎𝑞2𝑡superscriptnormsubscript𝝁𝑞subscript𝐱𝑡subscript𝐱0subscript𝝁𝜽subscript𝐱𝑡2\displaystyle\qquad-\sum_{t=2}^{T}\mathbb{E}_{q(\mathbf{x}_{t}|\mathbf{x}_{0})% }\Big{[}\frac{1}{2\sigma_{q}^{2}(t)}\|\boldsymbol{\mu}_{q}(\mathbf{x}_{t},% \mathbf{x}_{0})-\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})\|^{2}% \Big{]}.- ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (43)

    몇 가지 관찰이 흥미롭습니다.

    • q𝑞qitalic_q𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT를 아는 한 완전히 기술되어 있기 때문에, 모든 첨자 ϕbold-italic-ϕ\boldsymbol{\phi}bold_italic_ϕ를 탈락시켰다. 우리는 단지 𝐱1,,𝐱Tsubscript𝐱1subscript𝐱𝑇\mathbf{x}_{1},\ldots,\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT 각각에 백색 잡음을 추가(다른 수준의)하고 있을 뿐이다. 이것은 우리가 𝜽𝜽\boldsymbol{\theta}bold_italic_θ에 대해 최적화하기만 하면 되는 ELBO를 우리에게 줄 것이다.

    • 파라미터 𝜽𝜽\boldsymbol{\theta}bold_italic_θ는 네트워크 𝝁𝜽(𝐱t)subscript𝝁𝜽subscript𝐱𝑡\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )를 통해 실현된다. 𝝁𝜽(𝐱t)subscript𝝁𝜽subscript𝐱𝑡\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )에 대한 네트워크 가중치이다.

    • The sampling from q(𝐱t|𝐱0)𝑞conditionalsubscript𝐱𝑡subscript𝐱0q(\mathbf{x}_{t}|\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is done according to Eqn (21) which states that q(𝐱t|𝐱0)=𝒩(𝐱t|α¯t𝐱0,(1α¯t)𝐈)𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝒩conditionalsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈q(\mathbf{x}_{t}|\mathbf{x}_{0})=\mathcal{N}(\mathbf{x}_{t}|\sqrt{\overline{% \alpha}_{t}}\mathbf{x}_{0},(1-\overline{\alpha}_{t})\mathbf{I})italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ).

    • 𝐱tq(𝐱t|𝐱0)similar-tosubscript𝐱𝑡𝑞conditionalsubscript𝐱𝑡subscript𝐱0\mathbf{x}_{t}\sim q(\mathbf{x}_{t}|\mathbf{x}_{0})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )가 주어지면, 우리는 logp𝜽(𝐱0|𝐱1)subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1\log p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )를 계산할 수 있는데, 이는 단지 log𝒩(𝐱0|𝝁𝜽(𝐱1),σq2(1)𝐈)𝒩conditionalsubscript𝐱0subscript𝝁𝜽subscript𝐱1superscriptsubscript𝜎𝑞21𝐈\log\mathcal{N}(\mathbf{x}_{0}|\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{% x}_{1}),\sigma_{q}^{2}(1)\mathbf{I})roman_log caligraphic_N ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ) bold_I )이다. 그래서 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT를 아는 즉시 네트워크 𝝁𝜽(𝐱1)subscript𝝁𝜽subscript𝐱1\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{1})bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )에 전송하여 평균 추정치를 반환할 수 있다. 그런 다음 평균 추정치를 사용하여 우도를 계산합니다.

    우리가 더 내려가기 전에, Eqn (35)이 어떻게 결정되었는지 논의함으로써 이야기를 완성하자.

    Proof of Eqn (35). Using the Bayes theorem stated in Eqn (31), q(𝐱t1|𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) can be determined if we evaluate the following product of Gaussians q(𝐱t1|𝐱t,𝐱0)=𝒩(𝐱t|αt𝐱t1,(1αt)𝐈)𝒩(𝐱t1|α¯t1,(1α¯t1𝐈))𝒩(𝐱t|α¯t𝐱0,(1α¯t)𝐈).𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝒩conditionalsubscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡11subscript𝛼𝑡𝐈𝒩conditionalsubscript𝐱𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡1𝐈𝒩conditionalsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})=\frac{\mathcal{N}(\mathbf{x}% _{t}|\sqrt{\alpha_{t}}\mathbf{x}_{t-1},(1-\alpha_{t})\mathbf{I})\mathcal{N}(% \mathbf{x}_{t-1}|\sqrt{\overline{\alpha}_{t-1}},(1-\overline{\alpha}_{t-1}% \mathbf{I}))}{\mathcal{N}(\mathbf{x}_{t}|\sqrt{\overline{\alpha}_{t}}\mathbf{x% }_{0},(1-\overline{\alpha}_{t})\mathbf{I})}.italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT bold_I ) ) end_ARG start_ARG caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ) end_ARG . (44) For simplicity we will treat the vectors are scalars. Then the above product of Gaussians will become q(𝐱t1|𝐱t,𝐱0)exp{(𝐱tαt𝐱t1)22(1αt)+(𝐱t1α¯t1z)22(1α¯t1)(𝐱tα¯t𝐱0)22(1α¯t)}.proportional-to𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0superscriptsubscript𝐱𝑡subscript𝛼𝑡subscript𝐱𝑡1221subscript𝛼𝑡superscriptsubscript𝐱𝑡1subscript¯𝛼𝑡1𝑧221subscript¯𝛼𝑡1superscriptsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱0221subscript¯𝛼𝑡q(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})\propto\exp\left\{\frac{(% \mathbf{x}_{t}-\sqrt{\alpha_{t}}\mathbf{x}_{t-1})^{2}}{2(1-\alpha_{t})}+\frac{% (\mathbf{x}_{t-1}-\sqrt{\overline{\alpha}_{t-1}}z)^{2}}{2(1-\overline{\alpha}_% {t-1})}-\frac{(\mathbf{x}_{t}-\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0})^{2}}% {2(1-\overline{\alpha}_{t})}\right\}.italic_q ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∝ roman_exp { divide start_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG + divide start_ARG ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) end_ARG - divide start_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG } . (45) We consider the following mapping: x𝑥\displaystyle xitalic_x =𝐱t,absentsubscript𝐱𝑡\displaystyle=\mathbf{x}_{t},= bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , a=αt𝑎subscript𝛼𝑡\displaystyle a=\alpha_{t}italic_a = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT y𝑦\displaystyle yitalic_y =𝐱t1,absentsubscript𝐱𝑡1\displaystyle=\mathbf{x}_{t-1},= bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , b=α¯t1𝑏subscript¯𝛼𝑡1\displaystyle b=\overline{\alpha}_{t-1}italic_b = over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT z𝑧\displaystyle zitalic_z =𝐱0,absentsubscript𝐱0\displaystyle=\mathbf{x}_{0},= bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , c=α¯t.𝑐subscript¯𝛼𝑡\displaystyle c=\overline{\alpha}_{t}.italic_c = over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . Consider a quadratic function f(y)=(xay)22(1a)+(ybz)22(1b)(xcz)22(1c).𝑓𝑦superscript𝑥𝑎𝑦221𝑎superscript𝑦𝑏𝑧221𝑏superscript𝑥𝑐𝑧221𝑐f(y)=\frac{(x-\sqrt{a}y)^{2}}{2(1-a)}+\frac{(y-\sqrt{b}z)^{2}}{2(1-b)}-\frac{(% x-\sqrt{c}z)^{2}}{2(1-c)}.italic_f ( italic_y ) = divide start_ARG ( italic_x - square-root start_ARG italic_a end_ARG italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_a ) end_ARG + divide start_ARG ( italic_y - square-root start_ARG italic_b end_ARG italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_b ) end_ARG - divide start_ARG ( italic_x - square-root start_ARG italic_c end_ARG italic_z ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_c ) end_ARG . (46) We know that no matter how we rearrange the terms, the resulting function remains a quadratic equation. The minimizer of f(y)𝑓𝑦f(y)italic_f ( italic_y ) is the mean of the resulting Gaussian. So, we can calculate the derivative of f𝑓fitalic_f and show that f(y)=1ab(1a)(1b)y(a1ax+b1bz).superscript𝑓𝑦1𝑎𝑏1𝑎1𝑏𝑦𝑎1𝑎𝑥𝑏1𝑏𝑧\displaystyle f^{\prime}(y)=\frac{1-ab}{(1-a)(1-b)}y-\left(\frac{\sqrt{a}}{1-a% }x+\frac{\sqrt{b}}{1-b}z\right).italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) = divide start_ARG 1 - italic_a italic_b end_ARG start_ARG ( 1 - italic_a ) ( 1 - italic_b ) end_ARG italic_y - ( divide start_ARG square-root start_ARG italic_a end_ARG end_ARG start_ARG 1 - italic_a end_ARG italic_x + divide start_ARG square-root start_ARG italic_b end_ARG end_ARG start_ARG 1 - italic_b end_ARG italic_z ) . Setting f(y)=0superscript𝑓𝑦0f^{\prime}(y)=0italic_f start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_y ) = 0 yields y=(1b)a1abx+(1a)b1abz.𝑦1𝑏𝑎1𝑎𝑏𝑥1𝑎𝑏1𝑎𝑏𝑧y=\frac{(1-b)\sqrt{a}}{1-ab}x+\frac{(1-a)\sqrt{b}}{1-ab}z.italic_y = divide start_ARG ( 1 - italic_b ) square-root start_ARG italic_a end_ARG end_ARG start_ARG 1 - italic_a italic_b end_ARG italic_x + divide start_ARG ( 1 - italic_a ) square-root start_ARG italic_b end_ARG end_ARG start_ARG 1 - italic_a italic_b end_ARG italic_z . (47) We note that ab=αtα¯t1=α¯t𝑎𝑏subscript𝛼𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡ab=\alpha_{t}\overline{\alpha}_{t-1}=\overline{\alpha}_{t}italic_a italic_b = italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. So, 𝝁q(𝐱t,𝐱0)=(1α¯t1)αt1α¯t𝐱t+(1αt)α¯t11α¯t𝐱0.subscript𝝁𝑞subscript𝐱𝑡subscript𝐱01subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝐱0\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{(1-\overline{\alpha}% _{t-1})\sqrt{\alpha_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}+\frac{(1-% \alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t}}\mathbf{x}_% {0}.bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (48) Similarly, for the variance, we can check the curvature f′′(y)superscript𝑓′′𝑦f^{\prime\prime}(y)italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_y ). We can easily show that f′′(y)=1ab(1a)(1b)=1α¯t(1αt)(1α¯t1).superscript𝑓′′𝑦1𝑎𝑏1𝑎1𝑏1subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡1f^{\prime\prime}(y)=\frac{1-ab}{(1-a)(1-b)}=\frac{1-\overline{\alpha}_{t}}{(1-% \alpha_{t})(1-\sqrt{\overline{\alpha}_{t-1}})}.italic_f start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT ( italic_y ) = divide start_ARG 1 - italic_a italic_b end_ARG start_ARG ( 1 - italic_a ) ( 1 - italic_b ) end_ARG = divide start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ) end_ARG . Taking the reciprocal will give us 𝚺q(t)=(1αt)(1α¯t1)1α¯t𝐈.subscript𝚺𝑞𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡𝐈\boldsymbol{\Sigma}_{q}(t)=\frac{(1-\alpha_{t})(1-\sqrt{\overline{\alpha}_{t-1% }})}{1-\overline{\alpha}_{t}}\mathbf{I}.bold_Σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) = divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( 1 - square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_I . (49)

    2.7 Training and Inference

    Eqn(43)의 ELBO는 우리가 어떻게든 이러한 손실을 최소화할 수 있는 네트워크 𝝁𝜽subscript𝝁𝜽\boldsymbol{\mu}_{\boldsymbol{\theta}}bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT를 찾아야 함을 시사한다:

    12σq2(t)𝝁q(𝐱t,𝐱0)known𝝁𝜽(𝐱t)network2.12superscriptsubscript𝜎𝑞2𝑡superscriptnormknownsubscript𝝁𝑞subscript𝐱𝑡subscript𝐱0networksubscript𝝁𝜽subscript𝐱𝑡2\frac{1}{2\sigma_{q}^{2}(t)}\|\underset{\text{known}}{\underbrace{\boldsymbol{% \mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})}}-\underset{\text{network}}{% \underbrace{\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})}}\|^{2}.\ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∥ underknown start_ARG under⏟ start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG end_ARG - undernetwork start_ARG under⏟ start_ARG bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (50)

    그러나 "디노이징" 개념은 어디에서 비롯됩니까?

    이를 보기 위해 우리는 Eqn(36)에서 회상한다.

    𝝁q(𝐱t,𝐱0)=(1α¯t1)αt1α¯t𝐱t+(1αt)α¯t11α¯t𝐱0.subscript𝝁𝑞subscript𝐱𝑡subscript𝐱01subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝐱0\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})=\frac{(1-\overline{\alpha}% _{t-1})\sqrt{\alpha_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}+\frac{(1-% \alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t}}\mathbf{x}_% {0}.bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = divide start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (51)

    𝝁𝜽subscript𝝁𝜽\boldsymbol{\mu}_{\boldsymbol{\theta}}bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT는 우리의 design이기 때문에 더 편리한 것으로 정의할 수 없는 이유는 없다. 그래서 여기 옵션이 있습니다.

    𝝁𝜽a network(𝐱t)=def(1α¯t1)αt1α¯t𝐱t+(1αt)α¯t11α¯t𝐱^𝜽(𝐱t)another network.a networksubscript𝝁𝜽subscript𝐱𝑡def1subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡another networksubscript^𝐱𝜽subscript𝐱𝑡\underset{\text{a network}}{\underbrace{\boldsymbol{\mu}_{\boldsymbol{\theta}}% }}(\mathbf{x}_{t})\overset{\text{def}}{=}\frac{(1-\overline{\alpha}_{t-1})% \sqrt{\alpha_{t}}}{1-\overline{\alpha}_{t}}\mathbf{x}_{t}+\frac{(1-\alpha_{t})% \sqrt{\overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t}}\underset{\text{% another network}}{\underbrace{\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(% \mathbf{x}_{t})}}.undera network start_ARG under⏟ start_ARG bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) overdef start_ARG = end_ARG divide start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG underanother network start_ARG under⏟ start_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG end_ARG . (52)

    Eqn(51) 및 Eqn(52)을 Eqn(50)에 대입하면,

    12σq2(t)𝝁q(𝐱t,𝐱0)𝝁𝜽(𝐱t)212superscriptsubscript𝜎𝑞2𝑡superscriptnormsubscript𝝁𝑞subscript𝐱𝑡subscript𝐱0subscript𝝁𝜽subscript𝐱𝑡2\displaystyle\frac{1}{2\sigma_{q}^{2}(t)}\|\boldsymbol{\mu}_{q}(\mathbf{x}_{t}% ,\mathbf{x}_{0})-\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})\|^{2}divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =12σq2(t)(1αt)α¯t11α¯t(𝐱^𝜽(𝐱t)𝐱0)2absent12superscriptsubscript𝜎𝑞2𝑡superscriptnorm1subscript𝛼𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript^𝐱𝜽subscript𝐱𝑡subscript𝐱02\displaystyle=\frac{1}{2\sigma_{q}^{2}(t)}\left\|\frac{(1-\alpha_{t})\sqrt{% \overline{\alpha}_{t-1}}}{1-\overline{\alpha}_{t}}(\widehat{\mathbf{x}}_{% \boldsymbol{\theta}}(\mathbf{x}_{t})-\mathbf{x}_{0})\right\|^{2}= divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∥ divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
    =12σq2(t)(1αt)2α¯t1(1α¯t)2𝐱^𝜽(𝐱t)𝐱02absent12superscriptsubscript𝜎𝑞2𝑡superscript1subscript𝛼𝑡2subscript¯𝛼𝑡1superscript1subscript¯𝛼𝑡2superscriptnormsubscript^𝐱𝜽subscript𝐱𝑡subscript𝐱02\displaystyle=\frac{1}{2\sigma_{q}^{2}(t)}\frac{(1-\alpha_{t})^{2}\overline{% \alpha}_{t-1}}{(1-\overline{\alpha}_{t})^{2}}\left\|\widehat{\mathbf{x}}_{% \boldsymbol{\theta}}(\mathbf{x}_{t})-\mathbf{x}_{0}\right\|^{2}= divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

    따라서 ELBO를 단순화할 수 있다.

    ELBO𝜽subscriptELBO𝜽\displaystyle\text{ELBO}_{\boldsymbol{\theta}}ELBO start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT =𝔼q(𝐱1|𝐱0)[logp𝜽(𝐱0|𝐱1)]t=2T𝔼q(𝐱t|𝐱0)[12σq2(t)𝝁q(𝐱t,𝐱0)𝝁𝜽(𝐱t)2]absentsubscript𝔼𝑞conditionalsubscript𝐱1subscript𝐱0delimited-[]subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1superscriptsubscript𝑡2𝑇subscript𝔼𝑞conditionalsubscript𝐱𝑡subscript𝐱0delimited-[]12superscriptsubscript𝜎𝑞2𝑡superscriptnormsubscript𝝁𝑞subscript𝐱𝑡subscript𝐱0subscript𝝁𝜽subscript𝐱𝑡2\displaystyle=\mathbb{E}_{q(\mathbf{x}_{1}|\mathbf{x}_{0})}[\log p_{% \boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})]-\sum_{t=2}^{T}\mathbb{E}_% {q(\mathbf{x}_{t}|\mathbf{x}_{0})}\Big{[}\frac{1}{2\sigma_{q}^{2}(t)}\|% \boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})-\boldsymbol{\mu}_{% \boldsymbol{\theta}}(\mathbf{x}_{t})\|^{2}\Big{]}= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] - ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
    =𝔼q(𝐱1|𝐱0)[logp𝜽(𝐱0|𝐱1)]t=2T𝔼q(𝐱t|𝐱0)[12σq2(t)(1αt)2α¯t1(1α¯t)2𝐱^𝜽(𝐱t)𝐱02].absentsubscript𝔼𝑞conditionalsubscript𝐱1subscript𝐱0delimited-[]subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1superscriptsubscript𝑡2𝑇subscript𝔼𝑞conditionalsubscript𝐱𝑡subscript𝐱0delimited-[]12superscriptsubscript𝜎𝑞2𝑡superscript1subscript𝛼𝑡2subscript¯𝛼𝑡1superscript1subscript¯𝛼𝑡2superscriptnormsubscript^𝐱𝜽subscript𝐱𝑡subscript𝐱02\displaystyle=\mathbb{E}_{q(\mathbf{x}_{1}|\mathbf{x}_{0})}[\log p_{% \boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})]-\sum_{t=2}^{T}\mathbb{E}_% {q(\mathbf{x}_{t}|\mathbf{x}_{0})}\Big{[}\frac{1}{2\sigma_{q}^{2}(t)}\frac{(1-% \alpha_{t})^{2}\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_{t})^{2}}\left\|% \widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})-\mathbf{x}_{0}% \right\|^{2}\Big{]}.= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] - ∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (53)

    첫 번째 항은

    logp𝜽(𝐱0|𝐱1)subscript𝑝𝜽conditionalsubscript𝐱0subscript𝐱1\displaystyle\log p_{\boldsymbol{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{1})roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) =log𝒩(𝐱0|𝝁𝜽(𝐱1),σq2(1)𝐈)12σq2(1)𝝁𝜽(𝐱1)𝐱02absent𝒩conditionalsubscript𝐱0subscript𝝁𝜽subscript𝐱1superscriptsubscript𝜎𝑞21𝐈proportional-to12superscriptsubscript𝜎𝑞21superscriptnormsubscript𝝁𝜽subscript𝐱1subscript𝐱02\displaystyle=\log\mathcal{N}(\mathbf{x}_{0}|\boldsymbol{\mu}_{\boldsymbol{% \theta}}(\mathbf{x}_{1}),\sigma_{q}^{2}(1)\mathbf{I})\propto-\frac{1}{2\sigma_% {q}^{2}(1)}\|\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{1})-\mathbf{x}% _{0}\|^{2}= roman_log caligraphic_N ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ) bold_I ) ∝ - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ) end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT definition
    =12σq2(1)(1α¯0)α11α¯1𝐱1+(1α1)α¯01α¯1𝐱^𝜽(𝐱1)𝐱02absent12superscriptsubscript𝜎𝑞21superscriptnorm1subscript¯𝛼0subscript𝛼11subscript¯𝛼1subscript𝐱11subscript𝛼1subscript¯𝛼01subscript¯𝛼1subscript^𝐱𝜽subscript𝐱1subscript𝐱02\displaystyle=-\frac{1}{2\sigma_{q}^{2}(1)}\left\|\frac{(1-\overline{\alpha}_{% 0})\sqrt{\alpha_{1}}}{1-\overline{\alpha}_{1}}\mathbf{x}_{1}+\frac{(1-\alpha_{% 1})\sqrt{\overline{\alpha}_{0}}}{1-\overline{\alpha}_{1}}\widehat{\mathbf{x}}_% {\boldsymbol{\theta}}(\mathbf{x}_{1})-\mathbf{x}_{0}\right\|^{2}= - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ) end_ARG ∥ divide start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) square-root start_ARG italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT recall α0=1subscript𝛼01\alpha_{0}=1italic_α start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1
    =12σq2(1)(1α1)1α¯1𝐱^𝜽(𝐱1)𝐱02=12σq2(1)𝐱^𝜽(𝐱1)𝐱02absent12superscriptsubscript𝜎𝑞21superscriptnorm1subscript𝛼11subscript¯𝛼1subscript^𝐱𝜽subscript𝐱1subscript𝐱0212superscriptsubscript𝜎𝑞21superscriptnormsubscript^𝐱𝜽subscript𝐱1subscript𝐱02\displaystyle=-\frac{1}{2\sigma_{q}^{2}(1)}\left\|\frac{(1-\alpha_{1})}{1-% \overline{\alpha}_{1}}\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{1% })-\mathbf{x}_{0}\right\|^{2}=-\frac{1}{2\sigma_{q}^{2}(1)}\left\|\widehat{% \mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{1})-\mathbf{x}_{0}\right\|^{2}= - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ) end_ARG ∥ divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = - divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 ) end_ARG ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT recall α¯1=α1subscript¯𝛼1subscript𝛼1\overline{\alpha}_{1}=\alpha_{1}over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (54)

    Eqn(54)을 Eqn(53)에 대입하면 ELBO는 다음과 같이 단순화된다.

    ELBO𝜽=t=1T𝔼q(𝐱t|𝐱0)[12σq2(t)(1αt)2α¯t1(1α¯t)2𝐱^𝜽(𝐱t)𝐱02].subscriptELBO𝜽superscriptsubscript𝑡1𝑇subscript𝔼𝑞conditionalsubscript𝐱𝑡subscript𝐱0delimited-[]12superscriptsubscript𝜎𝑞2𝑡superscript1subscript𝛼𝑡2subscript¯𝛼𝑡1superscript1subscript¯𝛼𝑡2superscriptnormsubscript^𝐱𝜽subscript𝐱𝑡subscript𝐱02\displaystyle\text{ELBO}_{\boldsymbol{\theta}}=-\sum_{t=1}^{T}\mathbb{E}_{q(% \mathbf{x}_{t}|\mathbf{x}_{0})}\Big{[}\frac{1}{2\sigma_{q}^{2}(t)}\frac{(1-% \alpha_{t})^{2}\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_{t})^{2}}\left\|% \widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})-\mathbf{x}_{0}% \right\|^{2}\Big{]}.ELBO start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

    따라서, 신경망의 훈련은 단순 손실 함수: The loss function denoising diffusion probabilistic model:

    Eqn(55)에 정의된 손실 함수는 매우 직관적이다. 특정 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT에 대한 주요 관심 주제인 상수와 기대를 무시하는 것은,

    argmin𝜽𝐱^𝜽(𝐱t)𝐱02.𝜽argminsuperscriptnormsubscript^𝐱𝜽subscript𝐱𝑡subscript𝐱02\mathop{\underset{\boldsymbol{\theta}}{\mbox{argmin}}}\;\;\left\|\widehat{% \mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})-\mathbf{x}_{0}\right\|^{2}.start_BIGOP underbold_italic_θ start_ARG argmin end_ARG end_BIGOP ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

    이는 노이즈 제거된 이미지 𝐱^𝜽(𝐱t)subscript^𝐱𝜽subscript𝐱𝑡\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )가 Ground truth 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT에 가까울 수 있도록 네트워크 𝐱^𝜽subscript^𝐱𝜽\widehat{\mathbf{x}}_{\boldsymbol{\theta}}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT를 찾아야 하기 때문에 노이즈 제거 문제일 뿐이다. 전형적인 데노이스가 아닌 이유는

    • 𝔼q(𝐱t|𝐱0)subscript𝔼𝑞conditionalsubscript𝐱𝑡subscript𝐱0\mathbb{E}_{q(\mathbf{x}_{t}|\mathbf{x}_{0})}blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT: 임의의 랜덤 노이즈 이미지를 노이즈 제거하려고 하지 않는다. 대신, 우리는 시끄러운 이미지를 신중하게 선택하고 있다.

      𝐱tq(𝐱t|𝐱0)similar-tosubscript𝐱𝑡𝑞conditionalsubscript𝐱𝑡subscript𝐱0\displaystyle\mathbf{x}_{t}\sim q(\mathbf{x}_{t}|\mathbf{x}_{0})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =𝒩(𝐱t|α¯t𝐱0,(1α¯t)𝐈)absent𝒩conditionalsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\displaystyle=\mathcal{N}(\mathbf{x}_{t}\,|\,\sqrt{\overline{\alpha}_{t}}% \mathbf{x}_{0},\;\;(1-\overline{\alpha}_{t})\mathbf{I})= caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I )
      =α¯t𝐱0+(1α¯t)𝐳,𝐳𝒩(0,𝐈).formulae-sequenceabsentsubscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐳similar-to𝐳𝒩0𝐈\displaystyle=\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{(1-\overline{% \alpha}_{t})}\mathbf{z},\qquad\mathbf{z}\sim\mathcal{N}(0,\mathbf{I}).= square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_z , bold_z ∼ caligraphic_N ( 0 , bold_I ) .

      여기서, "조심스럽다"는 것은 우리가 이미지에 주입하는 잡음의 양이 주의 깊게 조절된다는 것을 의미했다.

      Refer to caption
      Figure 14:Forward sampling process. 정방향 샘플링 프로세스는 원래 일련의 작업이다. 그러나 가우시안(Gaussian)을 가정하면 샘플링 과정을 1단계 데이터 생성으로 단순화할 수 있다.
    • 12σq2(t)(1αt)2α¯t1(1α¯t)212superscriptsubscript𝜎𝑞2𝑡superscript1subscript𝛼𝑡2subscript¯𝛼𝑡1superscript1subscript¯𝛼𝑡2\frac{1}{2\sigma_{q}^{2}(t)}\frac{(1-\alpha_{t})^{2}\overline{\alpha}_{t-1}}{(% 1-\overline{\alpha}_{t})^{2}}divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG: 모든 단계에 대해 동일하게 디노이징 손실을 가중하지 않는다. 대신에, 각각의 디노이징 손실에 대한 상대적인 강조를 제어하는 스케줄러가 있다. 그러나, 단순화를 위해, 우리는 이것들을 떨어뜨릴 수 있다. 그것은 사소한 영향을 끼친다.

    • t=1Tsuperscriptsubscript𝑡1𝑇\sum_{t=1}^{T}∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT: The summation can be replaced by a uniform distribution tUniform[1,T]similar-to𝑡Uniform1𝑇t\sim\text{Uniform}[1,T]italic_t ∼ Uniform [ 1 , italic_T ].

    Training a Deniosing Diffusion Probabilistic Model. (Version: Predict image) For every image 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in your training dataset: Repeat the following steps until convergence. Pick a random time stamp tUniform[1,T]similar-to𝑡Uniform1𝑇t\sim\text{Uniform}[1,T]italic_t ∼ Uniform [ 1 , italic_T ]. Draw a sample 𝐱t𝒩(𝐱t|α¯t𝐱0,(1α¯t)𝐈)similar-tosubscript𝐱𝑡𝒩conditionalsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\mathbf{x}_{t}\sim\mathcal{N}(\mathbf{x}_{t}\,|\,\sqrt{\overline{\alpha}_{t}}% \mathbf{x}_{0},\;\;(1-\overline{\alpha}_{t})\mathbf{I})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ), i.e., 𝐱t=α¯t𝐱0+(1α¯t)𝐳,𝐳𝒩(0,𝐈).formulae-sequencesubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐳similar-to𝐳𝒩0𝐈\mathbf{x}_{t}=\overline{\alpha}_{t}\mathbf{x}_{0}+\sqrt{(1-\overline{\alpha}_% {t})}\mathbf{z},\qquad\mathbf{z}\sim\mathcal{N}(0,\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG bold_z , bold_z ∼ caligraphic_N ( 0 , bold_I ) . Take gradient descent step on 𝜽𝐱^𝜽(𝐱t)𝐱02subscript𝜽superscriptnormsubscript^𝐱𝜽subscript𝐱𝑡subscript𝐱02\nabla_{\boldsymbol{\theta}}\left\|\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(% \mathbf{x}_{t})-\mathbf{x}_{0}\right\|^{2}∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ∥ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT You can do this in batches, just like how you train any other neural networks. Note that, here, you are training one denoising network 𝐱^𝜽subscript^𝐱𝜽\widehat{\mathbf{x}}_{\boldsymbol{\theta}}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT for all noisy conditions.
    Refer to caption
    도 15:Denoising diffusion probabilistic model의 Training. 동일한 신경망 𝐱^𝜽subscript^𝐱𝜽\widehat{\mathbf{x}}_{\boldsymbol{\theta}}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT에 대해, 우리는 잡음 입력 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT를 네트워크로 보낸다. 손실 기울기는 네트워크를 업데이트하기 위해 역전파된다. 노이즈 이미지는 임의적이지 않습니다. 이들은 순방향 샘플링 프로세스에 따라 생성된다.

    일단 denoiser 𝐱^𝜽subscript^𝐱𝜽\widehat{\mathbf{x}}_{\boldsymbol{\theta}}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT가 훈련되면, 우리는 추론을 하기 위해 그것을 적용할 수 있다. 추론은 상태들 𝐱T,𝐱T1,,𝐱1subscript𝐱𝑇subscript𝐱𝑇1subscript𝐱1\mathbf{x}_{T},\mathbf{x}_{T-1},\ldots,\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_T - 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT의 시퀀스에 걸쳐 분포들 p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )로부터 이미지들을 샘플링하는 것에 관한 것이다. 역확산 과정이기 때문에 우리는 그것을 재귀적으로 해야 한다:

    𝐱t1p𝜽(𝐱t1|𝐱t)similar-tosubscript𝐱𝑡1subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡\displaystyle\mathbf{x}_{t-1}\sim p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}\,|% \,\mathbf{x}_{t})bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒩(𝐱t1|𝝁𝜽(𝐱t),σq2(t)𝐈)absent𝒩conditionalsubscript𝐱𝑡1subscript𝝁𝜽subscript𝐱𝑡superscriptsubscript𝜎𝑞2𝑡𝐈\displaystyle=\mathcal{N}(\mathbf{x}_{t-1}\,|\,\boldsymbol{\mu}_{\boldsymbol{% \theta}}(\mathbf{x}_{t}),\sigma_{q}^{2}(t)\mathbf{I})= caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I )
    =𝝁𝜽(𝐱t)+σq2(t)𝐳,where𝐳𝒩(0,𝐈)formulae-sequenceabsentsubscript𝝁𝜽subscript𝐱𝑡superscriptsubscript𝜎𝑞2𝑡𝐳wheresimilar-to𝐳𝒩0𝐈\displaystyle=\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})+\sigma_{q% }^{2}(t)\mathbf{z},\qquad\qquad\qquad\text{where}\quad\mathbf{z}\sim\mathcal{N% }(0,\mathbf{I})= bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_z , where bold_z ∼ caligraphic_N ( 0 , bold_I )
    =(1α¯t1)αt1α¯t𝐱t+(1αt)α¯t11α¯t𝐱^𝜽(𝐱t)+σq(t)𝐳.absent1subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript^𝐱𝜽subscript𝐱𝑡subscript𝜎𝑞𝑡𝐳\displaystyle=\frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1-\overline{% \alpha}_{t}}\mathbf{x}_{t}+\frac{(1-\alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}% {1-\overline{\alpha}_{t}}\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}% _{t})+\sigma_{q}(t)\mathbf{z}.= divide start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z .

    이는 다음과 같은 추론 알고리즘으로 이어진다.

    Inference on a Deniosing Diffusion Probabilistic Model. (Version: Predict image) You give us a white noise vector 𝐱T𝒩(0,𝐈)similar-tosubscript𝐱𝑇𝒩0𝐈\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ). Repeat the following for t=T,T1,,1𝑡𝑇𝑇11t=T,T-1,\ldots,1italic_t = italic_T , italic_T - 1 , … , 1. We calculate 𝐱^𝜽(𝐱t)subscript^𝐱𝜽subscript𝐱𝑡\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using our trained denoiser. Update according to 𝐱t1subscript𝐱𝑡1\displaystyle\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =(1α¯t1)αt1α¯t𝐱t+(1αt)α¯t11α¯t𝐱^𝜽(𝐱t)+σq(t)𝐳,𝐳𝒩(0,𝐈).formulae-sequenceabsent1subscript¯𝛼𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript^𝐱𝜽subscript𝐱𝑡subscript𝜎𝑞𝑡𝐳similar-to𝐳𝒩0𝐈\displaystyle=\frac{(1-\overline{\alpha}_{t-1})\sqrt{\alpha_{t}}}{1-\overline{% \alpha}_{t}}\mathbf{x}_{t}+\frac{(1-\alpha_{t})\sqrt{\overline{\alpha}_{t-1}}}% {1-\overline{\alpha}_{t}}\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}% _{t})+\sigma_{q}(t)\mathbf{z},\qquad\mathbf{z}\sim\mathcal{N}(0,\mathbf{I}).= divide start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z , bold_z ∼ caligraphic_N ( 0 , bold_I ) . (56)
    Refer to caption
    도 16:Denoising diffusion probabilistic model의 추론.

    2.8 Derivation based on Noise Vector

    노이즈 제거 문헌에 익숙하다면 신호 대신 잡음을 예측하는 레지듀 형태의 알고리즘을 알고 있을 것이다. 우리가 소음을 예측하는 법을 배울 수 있는 디노이징 확산에도 같은 정신이 적용된다. 이것이 왜 그런지를 보기 위해, 우리는 Eqn(24)을 고려한다. 우리가 얻을 조건을 다시 조정하면

    𝐱t=α¯t𝐱0+1α¯tϵ0subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡subscriptbold-italic-ϵ0\displaystyle\mathbf{x}_{t}=\sqrt{\overline{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1% -\overline{\alpha}_{t}}\boldsymbol{\epsilon}_{0}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
    \displaystyle\Rightarrow\qquad 𝐱0=𝐱t1α¯tϵ0α¯t.subscript𝐱0subscript𝐱𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ0subscript¯𝛼𝑡\displaystyle\mathbf{x}_{0}=\frac{\mathbf{x}_{t}-\sqrt{1-\overline{\alpha}_{t}% }\boldsymbol{\epsilon}_{0}}{\sqrt{\overline{\alpha}_{t}}}.bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG .

    이를 𝝁q(𝐱t,𝐱0)subscript𝝁𝑞subscript𝐱𝑡subscript𝐱0\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )에 대입하면, 우리는

    𝝁q(𝐱t,𝐱0)subscript𝝁𝑞subscript𝐱𝑡subscript𝐱0\displaystyle\boldsymbol{\mu}_{q}(\mathbf{x}_{t},\mathbf{x}_{0})bold_italic_μ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) =αt(1α¯t1)𝐱t+α¯t1(1αt)𝐱01α¯tabsentsubscript𝛼𝑡1subscript¯𝛼𝑡1subscript𝐱𝑡subscript¯𝛼𝑡11subscript𝛼𝑡subscript𝐱01subscript¯𝛼𝑡\displaystyle=\frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})\mathbf{x}_{t}% +\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})\mathbf{x}_{0}}{1-\overline{% \alpha}_{t}}= divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG
    =αt(1α¯t1)𝐱t+α¯t1(1αt)𝐱t1α¯tϵ0α¯t1α¯tabsentsubscript𝛼𝑡1subscript¯𝛼𝑡1subscript𝐱𝑡subscript¯𝛼𝑡11subscript𝛼𝑡subscript𝐱𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ0subscript¯𝛼𝑡1subscript¯𝛼𝑡\displaystyle=\frac{\sqrt{\alpha_{t}}(1-\overline{\alpha}_{t-1})\mathbf{x}_{t}% +\sqrt{\overline{\alpha}_{t-1}}(1-\alpha_{t})\cdot\frac{\mathbf{x}_{t}-\sqrt{1% -\overline{\alpha}_{t}}\boldsymbol{\epsilon}_{0}}{\sqrt{\overline{\alpha}_{t}}% }}{1-\overline{\alpha}_{t}}= divide start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG end_ARG start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG
    =a few more algebraic stepsabsenta few more algebraic steps\displaystyle=\text{a few more algebraic steps}= a few more algebraic steps
    =1αt𝐱t1αt1α¯tαtϵ0.absent1subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝛼𝑡subscriptbold-italic-ϵ0\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\mathbf{x}_{t}-\frac{1-\alpha_{t}}{% \sqrt{1-\overline{\alpha}_{t}}\sqrt{\alpha}_{t}}\boldsymbol{\epsilon}_{0}.= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (57)

    따라서, design 우리의 평균 추정기 𝝁𝜽subscript𝝁𝜽\boldsymbol{\mu}_{\boldsymbol{\theta}}bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT 양식에 맞게 자유롭게 선택할 수 있다:

    𝝁𝜽(𝐱t)=1αt𝐱t1αt1α¯tαtϵ^𝜽(𝐱t).subscript𝝁𝜽subscript𝐱𝑡1subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝛼𝑡subscript^bold-italic-ϵ𝜽subscript𝐱𝑡\displaystyle\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})=\frac{1}{% \sqrt{\alpha_{t}}}\mathbf{x}_{t}-\frac{1-\alpha_{t}}{\sqrt{1-\overline{\alpha}% _{t}}\sqrt{\alpha}_{t}}\widehat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(% \mathbf{x}_{t}).bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) . (58)

    Eqn(57)과 Eqn(58)을 Eqn(50)에 대입하면 우리에게 새로운 ELBO를 줄 것이다.

    ELBO𝜽=t=1T𝔼q(𝐱t|𝐱0)[12σq2(t)(1αt)2α¯t1(1α¯t)2ϵ^𝜽(𝐱t)ϵ02].subscriptELBO𝜽superscriptsubscript𝑡1𝑇subscript𝔼𝑞conditionalsubscript𝐱𝑡subscript𝐱0delimited-[]12superscriptsubscript𝜎𝑞2𝑡superscript1subscript𝛼𝑡2subscript¯𝛼𝑡1superscript1subscript¯𝛼𝑡2superscriptnormsubscript^bold-italic-ϵ𝜽subscript𝐱𝑡subscriptbold-italic-ϵ02\displaystyle\text{ELBO}_{\boldsymbol{\theta}}=-\sum_{t=1}^{T}\mathbb{E}_{q(% \mathbf{x}_{t}|\mathbf{x}_{0})}\Big{[}\frac{1}{2\sigma_{q}^{2}(t)}\frac{(1-% \alpha_{t})^{2}\overline{\alpha}_{t-1}}{(1-\overline{\alpha}_{t})^{2}}\left\|% \widehat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})-% \boldsymbol{\epsilon}_{0}\right\|^{2}\Big{]}.ELBO start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG divide start_ARG ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT end_ARG start_ARG ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_italic_ϵ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

    따라서, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT를 주면, 예측 노이즈 ϵ^𝜽(𝐱t)subscript^bold-italic-ϵ𝜽subscript𝐱𝑡\widehat{\boldsymbol{\epsilon}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )를 리턴할 것이다. 이를 통해 대체 학습 스킴이 제공됩니다. Training a Deniosing Diffusion Probabilistic Model (Version Predict noise) 훈련 데이터 세트의 모든 이미지 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT에 대해: 수렴할 때까지 다음 단계를 반복합니다. Pick a random time stamp tUniform[1,T]similar-to𝑡Uniform1𝑇t\sim\text{Uniform}[1,T]italic_t ∼ Uniform [ 1 , italic_T ]. Draw a sample 𝐱t𝒩(𝐱t|α¯t𝐱0,(1α¯t)𝐈)similar-tosubscript𝐱𝑡𝒩conditionalsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\mathbf{x}_{t}\sim\mathcal{N}(\mathbf{x}_{t}\,|\,\sqrt{\overline{\alpha}_{t}}% \mathbf{x}_{0},\;\;(1-\overline{\alpha}_{t})\mathbf{I})bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | square-root start_ARG over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ( 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_I ), i.e.,

    𝐱t1p𝜽(𝐱t1|𝐱t)similar-tosubscript𝐱𝑡1subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡\displaystyle\mathbf{x}_{t-1}\sim p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}\,|% \,\mathbf{x}_{t})bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) =𝒩(𝐱t1|𝝁𝜽(𝐱t),σq2(t)𝐈)absent𝒩conditionalsubscript𝐱𝑡1subscript𝝁𝜽subscript𝐱𝑡superscriptsubscript𝜎𝑞2𝑡𝐈\displaystyle=\mathcal{N}(\mathbf{x}_{t-1}\,|\,\boldsymbol{\mu}_{\boldsymbol{% \theta}}(\mathbf{x}_{t}),\sigma_{q}^{2}(t)\mathbf{I})= caligraphic_N ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_I )
    =𝝁𝜽(𝐱t)+σq2(t)𝐳absentsubscript𝝁𝜽subscript𝐱𝑡superscriptsubscript𝜎𝑞2𝑡𝐳\displaystyle=\boldsymbol{\mu}_{\boldsymbol{\theta}}(\mathbf{x}_{t})+\sigma_{q% }^{2}(t)\mathbf{z}= bold_italic_μ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) bold_z
    =1αt𝐱t1αt1α¯tαtϵ^𝜽(𝐱t)+σq(t)𝐳absent1subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript𝛼𝑡subscript^bold-italic-ϵ𝜽subscript𝐱𝑡subscript𝜎𝑞𝑡𝐳\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\mathbf{x}_{t}-\frac{1-\alpha_{t}}{% \sqrt{1-\overline{\alpha}_{t}}\sqrt{\alpha}_{t}}\widehat{\boldsymbol{\epsilon}% }_{\boldsymbol{\theta}}(\mathbf{x}_{t})+\sigma_{q}(t)\mathbf{z}= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z
    =1αt(𝐱t1αt1α¯tϵ^𝜽(𝐱t))+σq(t)𝐳absent1subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript^bold-italic-ϵ𝜽subscript𝐱𝑡subscript𝜎𝑞𝑡𝐳\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-\alpha_{% t}}{\sqrt{1-\overline{\alpha}_{t}}}\widehat{\boldsymbol{\epsilon}}_{% \boldsymbol{\theta}}(\mathbf{x}_{t})\right)+\sigma_{q}(t)\mathbf{z}= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z

    Summarizing it here, we have Inference on a Deniosing Diffusion Probabilistic Model. (Version Predict noise) You give us a white noise vector 𝐱T𝒩(0,𝐈)similar-tosubscript𝐱𝑇𝒩0𝐈\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I})bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ). Repeat the following for t=T,T1,,1𝑡𝑇𝑇11t=T,T-1,\ldots,1italic_t = italic_T , italic_T - 1 , … , 1. We calculate 𝐱^𝜽(𝐱t)subscript^𝐱𝜽subscript𝐱𝑡\widehat{\mathbf{x}}_{\boldsymbol{\theta}}(\mathbf{x}_{t})over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) using our trained denoiser. Update according to 𝐱t1subscript𝐱𝑡1\displaystyle\mathbf{x}_{t-1}bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT =1αt(𝐱t1αt1α¯tϵ^𝜽(𝐱t))+σq(t)𝐳,𝐳𝒩(0,𝐈).formulae-sequenceabsent1subscript𝛼𝑡subscript𝐱𝑡1subscript𝛼𝑡1subscript¯𝛼𝑡subscript^bold-italic-ϵ𝜽subscript𝐱𝑡subscript𝜎𝑞𝑡𝐳similar-to𝐳𝒩0𝐈\displaystyle=\frac{1}{\sqrt{\alpha_{t}}}\left(\mathbf{x}_{t}-\frac{1-\alpha_{% t}}{\sqrt{1-\overline{\alpha}_{t}}}\widehat{\boldsymbol{\epsilon}}_{% \boldsymbol{\theta}}(\mathbf{x}_{t})\right)+\sigma_{q}(t)\mathbf{z},\qquad% \mathbf{z}\sim\mathcal{N}(0,\mathbf{I}).= divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 1 - over¯ start_ARG italic_α end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG over^ start_ARG bold_italic_ϵ end_ARG start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) + italic_σ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ( italic_t ) bold_z , bold_z ∼ caligraphic_N ( 0 , bold_I ) .

    2.9 Inversion by Direct Denoising (InDI)

    DDPM 방정식을 살펴보면, 업데이트 Eqn(56)은 다음과 같은 형태를 취함을 알 수 있다:

    𝐱t1=(something)𝐱t+(something else)denoise(𝐱t)+noise.subscript𝐱𝑡1somethingsubscript𝐱𝑡something elsedenoisesubscript𝐱𝑡noise\mathbf{x}_{t-1}=\Big{(}\text{something}\Big{)}\cdot\mathbf{x}_{t}+\Big{(}% \text{something else}\Big{)}\cdot\text{denoise}(\mathbf{x}_{t})+\text{noise}.bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT = ( something ) ⋅ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( something else ) ⋅ denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + noise . (59)

    즉, (t1)𝑡1(t-1)( italic_t - 1 )-번째 추정치는 현재 추정치 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, 노이즈 제거된 버전 denoise(𝐱t)denoisesubscript𝐱𝑡\text{denoise}(\mathbf{x}_{t})denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 및 노이즈 항의 세 항의 선형 조합이다. 현재 추정치와 잡음항은 이해하기 쉽다. 하지만 "디노이즈"는 무엇일까요? Delbracio와 Milanfar [6]의 흥미로운 논문은 순수 디노이징 관점에서 생성 확산 모델을 살펴보았다. 알고 보니, 이 놀랍도록 단순한 관점은 다른 더 발전된 확산 모델과 몇 가지 좋은 측면에서 일치한다.

    What is 𝐝𝐞𝐧𝐨𝐢𝐬𝐞(𝐱t)𝐝𝐞𝐧𝐨𝐢𝐬𝐞subscript𝐱𝑡\text{denoise}(\mathbf{x}_{t})denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )? 노이즈 제거는 노이즈가 많은 영상에서 노이즈를 제거하는 일반적인 절차이다. 통계 신호 처리의 좋은 옛날에 표준 교과서 문제는 백색 잡음에 대한 최적의 데노이저를 도출하는 것이다. 관측 모형이 주어지면

    𝐲=𝐱+ϵ,whereϵ𝒩(0,𝐈),formulae-sequence𝐲𝐱bold-italic-ϵsimilar-towherebold-italic-ϵ𝒩0𝐈\mathbf{y}=\mathbf{x}+\boldsymbol{\epsilon},\qquad\text{where}\;\;\boldsymbol{% \epsilon}\sim\mathcal{N}(0,\mathbf{I}),bold_y = bold_x + bold_italic_ϵ , where bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) ,

    평균 제곱 오차가 최소화되도록 추정기 g()𝑔g(\cdot)italic_g ( ⋅ )를 구성할 수 있는가?

    우리는 이 고전적 문제에 대한 해의 도출은 [7, Chapter 8]와 같은 임의의 확률 교과서에서 찾을 수 있기 때문에 건너뛸 것이다. 상기 솔루션은,

    denoise(𝐲)denoise𝐲\displaystyle\text{denoise}(\mathbf{y})denoise ( bold_y ) =argmin𝑔𝔼𝐱,𝐲[g(𝐲)𝐱2]absent𝑔argminsubscript𝔼𝐱𝐲delimited-[]superscriptnorm𝑔𝐲𝐱2\displaystyle=\mathop{\underset{g}{\mbox{argmin}}}\;\;\mathbb{E}_{\mathbf{x},% \mathbf{y}}[\|g(\mathbf{y})-\mathbf{x}\|^{2}]= start_BIGOP underitalic_g start_ARG argmin end_ARG end_BIGOP blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT [ ∥ italic_g ( bold_y ) - bold_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
    =some magical stepabsentsome magical step\displaystyle=\text{some magical step}= some magical step
    =𝔼[𝐱|𝐲].absent𝔼delimited-[]conditional𝐱𝐲\displaystyle=\mathbb{E}[\mathbf{x}|\mathbf{y}].= blackboard_E [ bold_x | bold_y ] . (60)

    그래서 우리의 문제로 돌아가면: 만약 우리가 그렇게 가정한다면

    𝐱t=𝐱t1+ϵt1,whereϵt1𝒩(0,𝐈),formulae-sequencesubscript𝐱𝑡subscript𝐱𝑡1subscriptbold-italic-ϵ𝑡1similar-towheresubscriptbold-italic-ϵ𝑡1𝒩0𝐈\mathbf{x}_{t}=\mathbf{x}_{t-1}+\boldsymbol{\epsilon}_{t-1},\qquad\text{where}% \;\;\boldsymbol{\epsilon}_{t-1}\sim\mathcal{N}(0,\mathbf{I}),bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , where bold_italic_ϵ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) ,

    그렇다면 명백히 데노아저는 사후 분포의 조건부 기대치이다:

    denoise(𝐱t)=𝔼[𝐱t1|𝐱t].denoisesubscript𝐱𝑡𝔼delimited-[]conditionalsubscript𝐱𝑡1subscript𝐱𝑡\text{denoise}(\mathbf{x}_{t})=\mathbb{E}[\mathbf{x}_{t-1}|\mathbf{x}_{t}].denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E [ bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] . (61)

    따라서, 분포 p𝜽(𝐱t1|𝐱t)subscript𝑝𝜽conditionalsubscript𝐱𝑡1subscript𝐱𝑡p_{\boldsymbol{\theta}}(\mathbf{x}_{t-1}|\mathbf{x}_{t})italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )가 주어지면, 최적 데노이저는 이 분포의 조건부 기대치일 뿐이다. 이러한 denoiser를 minimum mean squared error (MMSE) denoiser라고 한다. MMSE denoiser는 not the "best" denoiser; 이는 평균 제곱 오차와 관련하여 최적의 denoiser일 뿐이다. 평균 제곱 오차는 결코 이미지 품질에 대한 좋은 메트릭이 아니기 때문에 MSE를 최소화하는 것이 반드시 더 나은 이미지를 제공하는 것은 아니다. 그럼에도 불구하고 사람들은 도출하기 쉽기 때문에 MMSE 데노아저를 좋아한다.

    Incremental Denoising Steps. MMSE 디노이저가 사후 분포의 조건부 기대치와 동일하다는 것을 이해하면 증분 디노이징을 이해할 수 있습니다. 작동 방식은 이렇습니다. 깨끗한 이미지 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT와 노이즈 이미지 𝐲𝐲\mathbf{y}bold_y를 가지고 있다고 가정하자. 우리의 목표는 간단한 방정식을 통해 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT𝐲𝐲\mathbf{y}bold_y의 선형 조합을 형성하는 것이다.

    𝐱t=(1t)𝐱0+t𝐲,0t1.formulae-sequencesubscript𝐱𝑡1𝑡subscript𝐱0𝑡𝐲0𝑡1\mathbf{x}_{t}=(1-t)\mathbf{x}_{0}+t\mathbf{y},\qquad 0\leq t\leq 1.bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 - italic_t ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t bold_y , 0 ≤ italic_t ≤ 1 . (62)

    Now, consider a small step τ𝜏\tauitalic_τ previous to time t𝑡titalic_t. The following result, showed by [6], provides some useful utilities: Let 0τ<t10𝜏𝑡10\leq\tau<t\leq 10 ≤ italic_τ < italic_t ≤ 1 , and suppose that 𝐱t=(1t)𝐱0+t𝐲subscript𝐱𝑡1𝑡subscript𝐱0𝑡𝐲\mathbf{x}_{t}=(1-t)\mathbf{x}_{0}+t\mathbf{y}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 - italic_t ) bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_t bold_y, then it holds that 𝔼[𝐱tτ|𝐱t]=(1τt)𝐱tcurrent estimate+τt𝔼[𝐱0|𝐱t]denoised.𝔼delimited-[]conditionalsubscript𝐱𝑡𝜏subscript𝐱𝑡1𝜏𝑡current estimatesubscript𝐱𝑡𝜏𝑡denoised𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡\mathbb{E}[\mathbf{x}_{t-\tau}|\mathbf{x}_{t}]=\left(1-\frac{\tau}{t}\right)% \underset{\text{current estimate}}{\underbrace{\mathbf{x}_{t}}}\qquad+\qquad% \frac{\tau}{t}\;\;\;\;\underset{\text{denoised}}{\underbrace{\mathbb{E}[% \mathbf{x}_{0}|\mathbf{x}_{t}]}}.blackboard_E [ bold_x start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] = ( 1 - divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG ) undercurrent estimate start_ARG under⏟ start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG end_ARG + divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG underdenoised start_ARG under⏟ start_ARG blackboard_E [ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] end_ARG end_ARG . (63) If we define 𝐱^tτsubscript^𝐱𝑡𝜏\widehat{\mathbf{x}}_{t-\tau}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT as the left hand side, replace 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by 𝐱^tsubscript^𝐱𝑡\widehat{\mathbf{x}}_{t}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and write 𝔼[𝐱0|𝐱t]𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡\mathbb{E}[\mathbf{x}_{0}|\mathbf{x}_{t}]blackboard_E [ bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] as denoise(𝐱^t)denoisesubscript^𝐱𝑡\text{denoise}(\widehat{\mathbf{x}}_{t})denoise ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), then the above equation will become

    𝐱^tτ=(1τt)𝐱^t+τtdenoise(𝐱^t),subscript^𝐱𝑡𝜏1𝜏𝑡subscript^𝐱𝑡𝜏𝑡denoisesubscript^𝐱𝑡\widehat{\mathbf{x}}_{t-\tau}=\left(1-\frac{\tau}{t}\right)\cdot\widehat{% \mathbf{x}}_{t}+\frac{\tau}{t}\text{denoise}(\widehat{\mathbf{x}}_{t}),over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT = ( 1 - divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG ) ⋅ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG denoise ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (64)

    여기서 τ𝜏\tauitalic_τ는 시간의 작은 단계이다.

    Eqn (64)은 우리에게 inference 단계를 제공한다. 데노이저를 알려주고 노이즈가 있는 이미지 𝐲𝐲\mathbf{y}bold_y로 시작한다고 가정하면, Eqn(64)을 반복적으로 적용하여 이미지 𝐱^t1subscript^𝐱𝑡1\widehat{\mathbf{x}}_{t-1}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT, 𝐱^t2subscript^𝐱𝑡2\widehat{\mathbf{x}}_{t-2}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - 2 end_POSTSUBSCRIPT, …, 𝐱^0subscript^𝐱0\widehat{\mathbf{x}}_{0}over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT를 검색할 수 있다.

    Training. 반복 스킴의 트레이닝은 denoise(𝐱t)denoisesubscript𝐱𝑡\text{denoise}(\mathbf{x}_{t})denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )를 생성하는 데노이저를 필요로 한다. 이를 위해, 우리는 신경망 denoise𝜽subscriptdenoise𝜽\text{denoise}_{\boldsymbol{\theta}}denoise start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT를 훈련시킬 수 있다(여기서 𝜽𝜽\boldsymbol{\theta}bold_italic_θ는 네트워크 가중치를 나타냄):

    minimize𝜽𝔼𝐱,𝐲𝔼tuniform[denoise𝜽(𝐱t)𝐱2].𝜽minimizesubscript𝔼𝐱𝐲subscript𝔼similar-to𝑡uniformdelimited-[]superscriptnormsubscriptdenoise𝜽subscript𝐱𝑡𝐱2\mathop{\underset{\boldsymbol{\theta}}{\mathrm{minimize}}}\;\;\mathbb{E}_{% \mathbf{x},\mathbf{y}}\mathbb{E}_{t\sim\text{uniform}}\Big{[}\|\text{denoise}_% {\boldsymbol{\theta}}(\mathbf{x}_{t})-\mathbf{x}\|^{2}\Big{]}.start_BIGOP underbold_italic_θ start_ARG roman_minimize end_ARG end_BIGOP blackboard_E start_POSTSUBSCRIPT bold_x , bold_y end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ uniform end_POSTSUBSCRIPT [ ∥ denoise start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (65)

    여기서, 분포 "tuniformsimilar-to𝑡uniformt\sim\text{uniform}italic_t ∼ uniform"는 시간 단계 t𝑡titalic_t가 주어진 분포로부터 균일하게 그려지도록 명시한다. 그러므로, 우리는 모든 시간 단계 t𝑡titalic_t에 대해 하나의 데노이저를 훈련시키고 있다. 기대 (𝐱,𝐲)𝐱𝐲(\mathbf{x},\mathbf{y})( bold_x , bold_y )는 일반적으로 훈련 데이터 세트에서 잡음이 많고 깨끗한 이미지 쌍을 사용할 때 충족됩니다. 훈련 후, 우리는 Eqn(64)을 통해 증분 업데이트를 수행할 수 있다.

    Connection with Denoising Score-Matching. 우리는 아직 스코어-매칭(다음 섹션에서 제시될 것임)에 대해 논의하지 않았지만, 위의 반복적 디노이징 절차에 대한 흥미로운 사실은 그것이 스코어-매칭 디노이징과 관련이 있다는 것이다. 높은 수준에서 반복을 다시 쓸 수 있습니다.

    𝐱tτsubscript𝐱𝑡𝜏\displaystyle\mathbf{x}_{t-\tau}bold_x start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT =(1τt)𝐱t+τtdenoise(𝐱t)absent1𝜏𝑡subscript𝐱𝑡𝜏𝑡denoisesubscript𝐱𝑡\displaystyle=\left(1-\frac{\tau}{t}\right)\cdot\mathbf{x}_{t}+\frac{\tau}{t}% \text{denoise}(\mathbf{x}_{t})= ( 1 - divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG ) ⋅ bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
    \displaystyle\Rightarrow\qquad 𝐱tτ𝐱tsubscript𝐱𝑡𝜏subscript𝐱𝑡\displaystyle\mathbf{x}_{t-\tau}-\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =τt𝐱t+τtdenoise(𝐱t)absent𝜏𝑡subscript𝐱𝑡𝜏𝑡denoisesubscript𝐱𝑡\displaystyle=-\frac{\tau}{t}\mathbf{x}_{t}+\frac{\tau}{t}\text{denoise}(% \mathbf{x}_{t})= - divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
    \displaystyle\Rightarrow\qquad 𝐱t𝐱tττsubscript𝐱𝑡subscript𝐱𝑡𝜏𝜏\displaystyle\frac{\mathbf{x}_{t}-\mathbf{x}_{t-\tau}}{\tau}divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG =𝐱tdenoise(𝐱t)tabsentsubscript𝐱𝑡denoisesubscript𝐱𝑡𝑡\displaystyle=\frac{\mathbf{x}_{t}-\text{denoise}(\mathbf{x}_{t})}{t}= divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_t end_ARG
    \displaystyle\Rightarrow\qquad d𝐱tdt=limτ0𝐱t𝐱tττ𝑑subscript𝐱𝑡𝑑𝑡subscript𝜏0subscript𝐱𝑡subscript𝐱𝑡𝜏𝜏\displaystyle\frac{d\mathbf{x}_{t}}{dt}=\lim_{\tau\rightarrow 0}\frac{\mathbf{% x}_{t}-\mathbf{x}_{t-\tau}}{\tau}divide start_ARG italic_d bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG = roman_lim start_POSTSUBSCRIPT italic_τ → 0 end_POSTSUBSCRIPT divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT end_ARG start_ARG italic_τ end_ARG =𝐱tdenoise(𝐱t)tabsentsubscript𝐱𝑡denoisesubscript𝐱𝑡𝑡\displaystyle=\frac{\mathbf{x}_{t}-\text{denoise}(\mathbf{x}_{t})}{t}= divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_t end_ARG

    이것은 상미분 방정식(ODE)이다. 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT의 노이즈 레벨이 σt2=t2σ2superscriptsubscript𝜎𝑡2superscript𝑡2superscript𝜎2\sigma_{t}^{2}=t^{2}\sigma^{2}italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT가 되도록 𝐱t=𝐱+tϵsubscript𝐱𝑡𝐱𝑡bold-italic-ϵ\mathbf{x}_{t}=\mathbf{x}+t\boldsymbol{\epsilon}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_x + italic_t bold_italic_ϵ를 허용하면, 우리는 문헌에서 여러 결과를 사용하여 다음을 보여줄 수 있다.

    d𝐱tdt𝑑subscript𝐱𝑡𝑑𝑡\displaystyle\frac{d\mathbf{x}_{t}}{dt}divide start_ARG italic_d bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_t end_ARG =12d(σt2)dt𝐱tlogpt(𝐱t)absent12𝑑superscriptsubscript𝜎𝑡2𝑑𝑡subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\displaystyle=-\frac{1}{2}\frac{d(\sigma_{t}^{2})}{dt}\nabla_{\mathbf{x}_{t}}% \log p_{t}(\mathbf{x}_{t})= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG divide start_ARG italic_d ( italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_d italic_t end_ARG ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (ODE defined by Song et al. [8])ODE defined by Song et al. [8]\displaystyle(\text{ODE defined by Song et al. \cite[cite]{[\@@bibref{}{Song_2% 021_SGM}{}{}]}})( ODE defined by Song et al. )
    =tσ2𝐱tlogpt(𝐱t)absent𝑡superscript𝜎2subscriptsubscript𝐱𝑡subscript𝑝𝑡subscript𝐱𝑡\displaystyle=-t\sigma^{2}\nabla_{\mathbf{x}_{t}}\log p_{t}(\mathbf{x}_{t})= - italic_t italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (σt=tσ)subscript𝜎𝑡𝑡𝜎\displaystyle(\sigma_{t}=t\sigma)( italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t italic_σ )
    tσ2𝐱denoise(𝐱t)t2σ2absent𝑡superscript𝜎2𝐱denoisesubscript𝐱𝑡superscript𝑡2superscript𝜎2\displaystyle\approx-t\sigma^{2}\frac{\mathbf{x}-\text{denoise}(\mathbf{x}_{t}% )}{t^{2}\sigma^{2}}≈ - italic_t italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG bold_x - denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (Approximation proposed by Vincent [9])Approximation proposed by Vincent [9]\displaystyle(\text{Approximation proposed by Vincent \cite[cite]{[\@@bibref{}% {Vincent_2011_DSM}{}{}]}})( Approximation proposed by Vincent )
    =𝐱tdenoise(𝐱t)t.absentsubscript𝐱𝑡denoisesubscript𝐱𝑡𝑡\displaystyle=\frac{\mathbf{x}_{t}-\text{denoise}(\mathbf{x}_{t})}{t}.= divide start_ARG bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_ARG start_ARG italic_t end_ARG .

    따라서, 증분 디노이징 반복은 적어도 ODE에 의해 결정된 제한 경우에 디노이징 스코어-매칭과 동등하다.

    Adding Stochastic Steps. 위의 점진적 잡음 제거 반복은 확률적 섭동을 포함할 수 있다. 추론 단계에 대해, 우리는 잡음 레벨들의 시퀀스 {σt| 0t1}conditional-setsubscript𝜎𝑡 0𝑡1\{\sigma_{t}\;|\;0\leq t\leq 1\}{ italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | 0 ≤ italic_t ≤ 1 }를 정의하고, 정의할 수 있다.

    𝐱^tτ=(1τt)𝐱^t+τtdenoise(𝐱^t)+(tτ)σtτ2σt2ϵ,ϵ𝒩(0,𝐈).formulae-sequencesubscript^𝐱𝑡𝜏1𝜏𝑡subscript^𝐱𝑡𝜏𝑡denoisesubscript^𝐱𝑡𝑡𝜏superscriptsubscript𝜎𝑡𝜏2superscriptsubscript𝜎𝑡2bold-italic-ϵsimilar-tobold-italic-ϵ𝒩0𝐈\widehat{\mathbf{x}}_{t-\tau}=\left(1-\frac{\tau}{t}\right)\cdot\widehat{% \mathbf{x}}_{t}+\frac{\tau}{t}\text{denoise}(\widehat{\mathbf{x}}_{t})+(t-\tau% )\sqrt{\sigma_{t-\tau}^{2}-\sigma_{t}^{2}}\boldsymbol{\epsilon},\qquad% \boldsymbol{\epsilon}\sim\mathcal{N}(0,\mathbf{I}).over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT = ( 1 - divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG ) ⋅ over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_τ end_ARG start_ARG italic_t end_ARG denoise ( over^ start_ARG bold_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ( italic_t - italic_τ ) square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_t - italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_italic_ϵ , bold_italic_ϵ ∼ caligraphic_N ( 0 , bold_I ) . (66)

    또는 훈련을 통해 데노이저를 훈련할 수 있다.

    minimize𝜽𝔼(𝐱,𝐲)𝔼tuniform𝔼ϵ[denoise(𝐱t)𝐱2],𝜽minimizesubscript𝔼𝐱𝐲subscript𝔼similar-to𝑡uniformsubscript𝔼bold-italic-ϵdelimited-[]superscriptnormdenoisesubscript𝐱𝑡𝐱2\mathop{\underset{\boldsymbol{\theta}}{\mathrm{minimize}}}\;\;\mathbb{E}_{(% \mathbf{x},\mathbf{y})}\mathbb{E}_{t\sim\text{uniform}}\mathbb{E}_{\boldsymbol% {\epsilon}}\left[\|\text{denoise}(\mathbf{x}_{t})-\mathbf{x}\|^{2}\right],start_BIGOP underbold_italic_θ start_ARG roman_minimize end_ARG end_BIGOP blackboard_E start_POSTSUBSCRIPT ( bold_x , bold_y ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t ∼ uniform end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_ϵ end_POSTSUBSCRIPT [ ∥ denoise ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (67)

    where 𝐱t=(1t)𝐱+t𝐲+tσtϵsubscript𝐱𝑡1𝑡𝐱𝑡𝐲𝑡subscript𝜎𝑡bold-italic-ϵ\mathbf{x}_{t}=(1-t)\mathbf{x}+t\mathbf{y}+\sqrt{t}\sigma_{t}\boldsymbol{\epsilon}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 - italic_t ) bold_x + italic_t bold_y + square-root start_ARG italic_t end_ARG italic_σ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_italic_ϵ.

    Congratulations! 우린 끝났어 이것은 모두 DDPM에 관한 것이다.

    DDPM의 문헌은 빠르게 폭발하고 있다. Sohl-Dickstein et al. [10]와 Ho et al. [4]의 원 논문은 주제를 이해하기 위한 필수 읽기이다. 보다 "사용자 친화적인" 버전의 경우 Luo의 자습서가 매우 유용한 [11]임을 발견했다. Song et al. [12]의 디노이징 확산 암시적 모델을 포함하여 일부 후속 작업이 많이 인용된다. 응용 측면에서 사람들은 다양한 이미지 합성 응용, 예를 들어 [13, 14]에 DDPM을 사용해 왔다.

    3 Score-Matching Langevin Dynamics (SMLD)

    점수 기반 생성 모델 [8]는 원하는 분포로부터 데이터를 생성하기 위한 대안적인 접근법이다. 랭뱅 다이내믹스, (스타인) 스코어 함수, 스코어 매칭 손실 등 몇 가지 핵심 요소가 있다. 본 절에서는 이러한 주제들을 하나씩 살펴보도록 하겠다.

    3.1 Langevin Dynamics

    우리 논의의 흥미로운 출발점은 랭뱅 역학이다. 그것은 생성 모델과 아무 관련이 없는 것처럼 보일 매우 물리학적인 주제이다. 하지만 걱정하지 마세요. 사실, 그들은 좋은 방식으로 관련이 있다.

    Instead of telling you the physics right a way, let’s talk about how Langevin dynamics can be used to draw samples from a distribution. Imagine that we are given a distribution p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) and suppose that we want to draw samples from p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ). Langevin dynamics is an iterative procedure that allows us to draw samples according to the following equation. The Langevin dynamics for sampling from a known distribution p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) is an iterative procedure for t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T: 𝐱t+1=𝐱t+τ𝐱logp(𝐱t)+2τ𝐳,𝐳𝒩(0,𝐈),formulae-sequencesubscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐱𝑝subscript𝐱𝑡2𝜏𝐳similar-to𝐳𝒩0𝐈\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\nabla_{\mathbf{x}}\log p(\mathbf{x}_{t})+% \sqrt{2\tau}\mathbf{z},\qquad\mathbf{z}\sim\mathcal{N}(0,\mathbf{I}),bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_τ end_ARG bold_z , bold_z ∼ caligraphic_N ( 0 , bold_I ) , (68) where τ𝜏\tauitalic_τ is the step size which users can control, and 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is white noise.

    여러분은 궁금할지도 모릅니다. 이 신비한 방정식은 도대체 무엇에 관한 것일까요? 여기 짧고 빠른 답변이 있습니다. 마지막에 노이즈 용어 2τ𝐳2𝜏𝐳\sqrt{2\tau}\mathbf{z}square-root start_ARG 2 italic_τ end_ARG bold_z를 무시하면, Eqn(68)의 Langevin dynamics equation은 말 그대로 gradient descent이다. 하강 방향 𝐱logp(𝐱)subscript𝐱𝑝𝐱\nabla_{\mathbf{x}}\log p(\mathbf{x})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x )𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT가 분포 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )로 수렴할 것임을 신중하게 선택한다. Langevin dynamics 방정식을 설명 하지 않고 10 분 동안 mumbling 하는 YouTube 비디오를 시청 하는 경우 다음과 같이 부드럽게 말할 수 있습니다. 잡음 용어 없이 Langevin dynamics is gradient descent.

    분포 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )를 고려한다. 이 분포의 형태가 정의되고 모형 모수가 정의되자마자 고정된다. 예를 들어 가우시안을 선택하면 평균과 분산을 지정하면 가우시안의 모양과 위치가 고정된다. 값 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )는 데이터 포인트 𝐱𝐱\mathbf{x}bold_x에서 평가된 확률 밀도에 불과하다. 따라서, 하나의 𝐱𝐱\mathbf{x}bold_x에서 다른 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT로 이동하면, 하나의 값 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )에서 다른 값 p(𝐱)𝑝superscript𝐱p(\mathbf{x}^{\prime})italic_p ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )로 이동하게 된다. 가우시안 기본 모양은 변경되지 않습니다.

    dsuperscript𝑑\mathbb{R}^{d}blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT에서 임의의 일부 위치로 시작한다고 가정하자. 우리는 그것을 분포의 정점(들)으로 옮기고 싶다. 피크는 확률이 가장 높은 곳이기 때문에 특별한 장소이다. 따라서, 샘플 𝐱𝐱\mathbf{x}bold_x가 분포 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )로부터 그려졌다고 하면, 확실히 𝐱𝐱\mathbf{x}bold_x에 대한 "최적" 위치는 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )가 최대가 되는 곳이다. p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )가 다수의 로컬 최소값을 갖는다면, 이들 중 어느 하나라도 무방할 것이다. 따라서 자연스럽게 샘플링의 목표는 최적화를 해결하는 것과 동일합니다.

    𝐱*=argmax𝐱logp(𝐱).superscript𝐱𝐱argmax𝑝𝐱\mathbf{x}^{*}=\mathop{\underset{\mathbf{x}}{\mbox{argmax}}}\;\;\log p(\mathbf% {x}).bold_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP underbold_x start_ARG argmax end_ARG end_BIGOP roman_log italic_p ( bold_x ) .

    우리는 이것이 not maximum likelihood estimation임을 다시 강조한다. 최대 우도에서, 데이터 포인트 𝐱𝐱\mathbf{x}bold_x는 고정되지만 모델 파라미터들은 변화하고 있다. 여기서, 모델 파라미터는 고정되어 있지만 데이터 포인트는 변화하고 있다. 아래 표는 차이점을 요약한 것이다.

    Problem

    Sampling

    Maximum Likelihood

    Optimization target

    A sample 𝐱𝐱\mathbf{x}bold_x

    Model parameter 𝜽𝜽\boldsymbol{\theta}bold_italic_θ

    Formulation

    𝐱*=argmax𝐱logp(𝐱;𝜽)superscript𝐱𝐱argmax𝑝𝐱𝜽\mathbf{x}^{*}=\mathop{\underset{\mathbf{x}}{\mbox{argmax}}}\;\;\log p(\mathbf% {x};\boldsymbol{\theta})bold_x start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP underbold_x start_ARG argmax end_ARG end_BIGOP roman_log italic_p ( bold_x ; bold_italic_θ )

    𝜽*=argmax𝜽logp(𝐱;𝜽)superscript𝜽𝜽argmax𝑝𝐱𝜽\boldsymbol{\theta}^{*}=\mathop{\underset{\boldsymbol{\theta}}{\mbox{argmax}}}% \;\;\log p(\mathbf{x};\boldsymbol{\theta})bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP underbold_italic_θ start_ARG argmax end_ARG end_BIGOP roman_log italic_p ( bold_x ; bold_italic_θ )

    최적화는 여러 가지 방법으로 해결할 수 있다. 가장 저렴한 방법은 물론 경사 하강이다. logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x )의 경우, 기울기 하강 단계는

    𝐱t+1=𝐱t+τ𝐱logp(𝐱t),subscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐱𝑝subscript𝐱𝑡\displaystyle\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\nabla_{\mathbf{x}}\log p(% \mathbf{x}_{t}),bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ,

    where 𝐱logp(𝐱t)subscript𝐱𝑝subscript𝐱𝑡\nabla_{\mathbf{x}}\log p(\mathbf{x}_{t})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) denotes the gradient of logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x ) evaluated at 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, and τ𝜏\tauitalic_τ is the step size. Here we use “+++” instead of the typical “--” because we are solving a maximization problem.

    Example. Consider a Gaussian distribution p(x)=𝒩(x|μ,σ2)𝑝𝑥𝒩conditional𝑥𝜇superscript𝜎2p(x)=\mathcal{N}(x\,|\,\mu,\sigma^{2})italic_p ( italic_x ) = caligraphic_N ( italic_x | italic_μ , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we can easily show that the Langevin dynamics equation is xt+1subscript𝑥𝑡1\displaystyle x_{t+1}italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =xt+τxlog{12πσ2e(xμ)22σ2}+2τzabsentsubscript𝑥𝑡𝜏subscript𝑥12𝜋superscript𝜎2superscript𝑒superscript𝑥𝜇22superscript𝜎22𝜏𝑧\displaystyle=x_{t}+\tau\cdot\nabla_{x}\log\left\{\frac{1}{\sqrt{2\pi\sigma^{2% }}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}\right\}+\sqrt{2\tau}z= italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ ⋅ ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log { divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT } + square-root start_ARG 2 italic_τ end_ARG italic_z =xtτxtμσ2+2τz,absentsubscript𝑥𝑡𝜏subscript𝑥𝑡𝜇superscript𝜎22𝜏𝑧\displaystyle=x_{t}-\tau\cdot\frac{x_{t}-\mu}{\sigma^{2}}+\sqrt{2\tau}z,= italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_τ ⋅ divide start_ARG italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_μ end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + square-root start_ARG 2 italic_τ end_ARG italic_z , z𝒩(0,1)similar-to𝑧𝒩01\displaystyle\qquad z\sim\mathcal{N}(0,1)italic_z ∼ caligraphic_N ( 0 , 1 )
    Example. Consider a Gaussian mixture p(x)=π1𝒩(x|μ1,σ12)+π2𝒩(x|μ2,σ22)𝑝𝑥subscript𝜋1𝒩conditional𝑥subscript𝜇1superscriptsubscript𝜎12subscript𝜋2𝒩conditional𝑥subscript𝜇2superscriptsubscript𝜎22p(x)=\pi_{1}\mathcal{N}(x\,|\,\mu_{1},\sigma_{1}^{2})+\pi_{2}\mathcal{N}(x\,|% \,\mu_{2},\sigma_{2}^{2})italic_p ( italic_x ) = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_N ( italic_x | italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_N ( italic_x | italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). We can numerically calculate xlogp(x)subscript𝑥𝑝𝑥\nabla_{x}\log p(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x ). For demonstration, we choose π1=0.6subscript𝜋10.6\pi_{1}=0.6italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.6. μ1=2subscript𝜇12\mu_{1}=2italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2, σ1=0.5subscript𝜎10.5\sigma_{1}=0.5italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5, π2=0.4subscript𝜋20.4\pi_{2}=0.4italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.4, μ2=2subscript𝜇22\mu_{2}=-2italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 2, σ2=0.2subscript𝜎20.2\sigma_{2}=0.2italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.2. We initialize x0=0subscript𝑥00x_{0}=0italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0. We choose τ=0.05𝜏0.05\tau=0.05italic_τ = 0.05. We run the above gradient descent iteration for T=500𝑇500T=500italic_T = 500 times, and we plot the trajectory of the values p(xt)𝑝subscript𝑥𝑡p(x_{t})italic_p ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) for t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T. As we can see in the figure below, the sequence {x1,x2,,xT}subscript𝑥1subscript𝑥2subscript𝑥𝑇\{x_{1},x_{2},\ldots,x_{T}\}{ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT } simply follows the shape of the Gaussian and climb to one of the peaks. What is more interesting is when we add the noise term. Instead of landing at the peak, the sequence xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT move around the peak and finishes somewhere near the peak. The closer we are to the peak, the higher probability we will stop there. [Uncaptioned image] [Uncaptioned image] 𝐱t+1=𝐱t+τ𝐱logp(𝐱t)subscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐱𝑝subscript𝐱𝑡\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\nabla_{\mathbf{x}}\log p(\mathbf{x}_{t})bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) 𝐱t+1=𝐱t+τ𝐱logp(𝐱t)+2τ𝐳subscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐱𝑝subscript𝐱𝑡2𝜏𝐳\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\nabla_{\mathbf{x}}\log p(\mathbf{x}_{t})+% \sqrt{2\tau}\mathbf{z}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_τ end_ARG bold_z

    17는 샘플 궤적에 대한 흥미로운 설명을 보여준다. 임의의 위치를 시작으로 데이터 포인트 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT는 Langevin 다이내믹스 방정식에 따라 랜덤 워크를 할 것이다. 임의 보행의 방향이 완전히 임의적인 것은 아니다. 모든 단계에서 일정 수준의 무작위성이 있는 동안 미리 정의된 드리프트가 있다. 드리프트는 𝐱logp(𝐱)subscript𝐱𝑝𝐱\nabla_{\mathbf{x}}\log p(\mathbf{x})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x )에 의해 결정되며, 여기서 랜덤성은 𝐳𝐳\mathbf{z}bold_z로부터 나온다.

    Refer to caption
    도 17:Langevin dynamics를 이용한 sample evolutions의 Trajectory. 우리는 더 나은 시각화를 위해 가우시안 혼합물의 두 가지 모드를 다른 색상으로 채색했다. 여기서의 설정은 스텝 사이즈가 τ=0.001𝜏0.001\tau=0.001italic_τ = 0.001인 것을 제외하고는 위의 예와 동일하다.

    위의 예에서 볼 수 있듯이 노이즈 항의 추가는 실제로 Gradient descent를 stochastic Gradient descent으로 변경한다. 결정론적 최적을 위해 촬영하는 대신 확률적 기울기 하강은 무작위로 언덕을 올라간다. 우리는 일정한 스텝 크기 2τ2𝜏\sqrt{2\tau}square-root start_ARG 2 italic_τ end_ARG를 사용하기 때문에, 최종 해는 단지 피크 주변에서 진동할 것이다. 따라서 Langevin dynamics 식을 요약할 수 있습니다. Langevin dynamics is stochastic gradient descent. 그런데 왜 기울기 하강 대신 확률적 기울기 하강을 하려는 것일까? 핵심은 우리가 최적화 문제를 푸는 데 관심이 없다는 것이다. 대신 배포에서 sampling에 더 관심이 있습니다. 무작위 잡음을 기울기 하강 단계에 도입함으로써 목적 함수의 궤적이 있는 곳에 머무르지 않으면서 목적 함수의 궤적을 따르는 표본을 무작위로 선택한다. 우리가 정상에 가까우면 좌우로 약간 움직일 것이다. 만약 우리가 피크에서 멀리 떨어져 있다면, 기울기 방향은 우리를 피크 쪽으로 끌어당길 것이다. 피크 주위의 곡률이 날카롭다면, 우리는 대부분의 정상 상태점 𝐱Tsubscript𝐱𝑇\mathbf{x}_{T}bold_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT에 집중할 것이다. 봉우리 주변의 곡률이 평평하다면 우리는 주변에 퍼질 것이다. 따라서, 균등하게 분포된 위치에서 확률적 경사 하강 알고리즘을 반복적으로 초기화함으로써, 우리는 결국 우리가 지정한 분포를 따를 샘플들을 수집할 것이다.

    Example. Consider a Gaussian mixture p(x)=π1𝒩(x|μ1,σ12)+π2𝒩(x|μ2,σ22)𝑝𝑥subscript𝜋1𝒩conditional𝑥subscript𝜇1superscriptsubscript𝜎12subscript𝜋2𝒩conditional𝑥subscript𝜇2superscriptsubscript𝜎22p(x)=\pi_{1}\mathcal{N}(x\,|\,\mu_{1},\sigma_{1}^{2})+\pi_{2}\mathcal{N}(x\,|% \,\mu_{2},\sigma_{2}^{2})italic_p ( italic_x ) = italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT caligraphic_N ( italic_x | italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT caligraphic_N ( italic_x | italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). We can numerically calculate xlogp(x)subscript𝑥𝑝𝑥\nabla_{x}\log p(x)∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x ). For demonstration, we choose π1=0.6subscript𝜋10.6\pi_{1}=0.6italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.6. μ1=2subscript𝜇12\mu_{1}=2italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2, σ1=0.5subscript𝜎10.5\sigma_{1}=0.5italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.5, π2=0.4subscript𝜋20.4\pi_{2}=0.4italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.4, μ2=2subscript𝜇22\mu_{2}=-2italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 2, σ2=0.2subscript𝜎20.2\sigma_{2}=0.2italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.2. Suppose we initialize M=10000𝑀10000M=10000italic_M = 10000 uniformly distributed samples x0Uniform[3,3]similar-tosubscript𝑥0Uniform33x_{0}\sim\text{Uniform}[-3,3]italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ Uniform [ - 3 , 3 ]. We run Langevin updates for t=100𝑡100t=100italic_t = 100 steps. The histograms of generated samples are shown in the figures below. [Uncaptioned image] [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
    Remark: Origin of Langevin Dynamics. The name Langevin dynamics of course does not originate from our “hacking” point of view. It starts with physics. Consider the basic Newton equation which relates force 𝐅𝐅\mathbf{F}bold_F with mass m𝑚mitalic_m and velocity 𝐯(t)𝐯𝑡\mathbf{v}(t)bold_v ( italic_t ). Newton’s second law says that 𝐅force=mmassd𝐯(t)dtacceleration.force𝐅mass𝑚acceleration𝑑𝐯𝑡𝑑𝑡\underset{\text{force}}{\underbrace{\mathbf{F}}}=\underset{\text{mass}}{% \underbrace{m}}\cdot\underset{\text{acceleration}}{\underbrace{\frac{d\mathbf{% v}(t)}{dt}}}.underforce start_ARG under⏟ start_ARG bold_F end_ARG end_ARG = undermass start_ARG under⏟ start_ARG italic_m end_ARG end_ARG ⋅ underacceleration start_ARG under⏟ start_ARG divide start_ARG italic_d bold_v ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG end_ARG end_ARG . (69) Given force 𝐅𝐅\mathbf{F}bold_F, we also know that it is related to the potential energy U(𝐱)𝑈𝐱U(\mathbf{x})italic_U ( bold_x ) via 𝐅force=𝐱U(𝐱)energy.force𝐅subscript𝐱energy𝑈𝐱\underset{\text{force}}{\underbrace{\mathbf{F}}}=\nabla_{\mathbf{x}}\underset{% \text{energy}}{\underbrace{U(\mathbf{x})}}.underforce start_ARG under⏟ start_ARG bold_F end_ARG end_ARG = ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT underenergy start_ARG under⏟ start_ARG italic_U ( bold_x ) end_ARG end_ARG . (70) The randomness of Langevin dynamics comes from Brownian motion. Imagine that we have a bag of molecules moving around. Their motion can be described according to the Brownian motion model: d𝐯(t)dt=λm𝐯(t)+1m𝜼,where𝜼𝒩(0,σ2𝐈).formulae-sequence𝑑𝐯𝑡𝑑𝑡𝜆𝑚𝐯𝑡1𝑚𝜼similar-towhere𝜼𝒩0superscript𝜎2𝐈\frac{d\mathbf{v}(t)}{dt}=-\frac{\lambda}{m}\mathbf{v}(t)+\frac{1}{m}% \boldsymbol{\eta},\qquad\text{where}\;\;\boldsymbol{\eta}\sim\mathcal{N}(0,% \sigma^{2}\mathbf{I}).divide start_ARG italic_d bold_v ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = - divide start_ARG italic_λ end_ARG start_ARG italic_m end_ARG bold_v ( italic_t ) + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG bold_italic_η , where bold_italic_η ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) . (71) Therefore, substituting Eqn (71) into Eqn (69), and equating it with Eqn (70), we have 𝐱U(𝐱)=λ𝐯(t)+𝜼𝐯(t)=1λ𝐱U(𝐱)+1λ𝜼.formulae-sequencesubscript𝐱𝑈𝐱𝜆𝐯𝑡𝜼𝐯𝑡1𝜆subscript𝐱𝑈𝐱1𝜆𝜼\displaystyle-\nabla_{\mathbf{x}}U(\mathbf{x})=-\lambda\mathbf{v}(t)+% \boldsymbol{\eta}\quad\Rightarrow\quad\mathbf{v}(t)=-\frac{1}{\lambda}\nabla_{% \mathbf{x}}U(\mathbf{x})+\frac{1}{\lambda}\boldsymbol{\eta}.- ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_U ( bold_x ) = - italic_λ bold_v ( italic_t ) + bold_italic_η ⇒ bold_v ( italic_t ) = - divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_U ( bold_x ) + divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG bold_italic_η . This can be equivalently written as d𝐱dt=1λ𝐱U(𝐱)+σλ𝐳,where𝐳𝒩(0,𝐈).formulae-sequence𝑑𝐱𝑑𝑡1𝜆subscript𝐱𝑈𝐱𝜎𝜆𝐳similar-towhere𝐳𝒩0𝐈\frac{d\mathbf{x}}{dt}=-\frac{1}{\lambda}\nabla_{\mathbf{x}}U(\mathbf{x})+% \frac{\sigma}{\lambda}\mathbf{z},\qquad\text{where}\;\;\mathbf{z}\sim\mathcal{% N}(0,\mathbf{I}).divide start_ARG italic_d bold_x end_ARG start_ARG italic_d italic_t end_ARG = - divide start_ARG 1 end_ARG start_ARG italic_λ end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_U ( bold_x ) + divide start_ARG italic_σ end_ARG start_ARG italic_λ end_ARG bold_z , where bold_z ∼ caligraphic_N ( 0 , bold_I ) . (72) If we let τ=dtλ𝜏𝑑𝑡𝜆\tau=\frac{dt}{\lambda}italic_τ = divide start_ARG italic_d italic_t end_ARG start_ARG italic_λ end_ARG and discretize the above differential equation, we will obtain 𝐱t+1=𝐱tτ𝐱U(𝐱t)+στ𝐳t.subscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐱𝑈subscript𝐱𝑡𝜎𝜏subscript𝐳𝑡\mathbf{x}_{t+1}=\mathbf{x}_{t}-\tau\nabla_{\mathbf{x}}U(\mathbf{x}_{t})+% \sigma\tau\mathbf{z}_{t}.bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_τ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_U ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_σ italic_τ bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (73) So it remains to identify the energy potential. A very reasonable (and lazy) choice for our probability distribution function p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) is the Boltzmann distribution with the form p(𝐱)=1Zexp{U(𝐱)}.𝑝𝐱1𝑍𝑈𝐱\displaystyle p(\mathbf{x})=\frac{1}{Z}\exp\left\{-U(\mathbf{x})\right\}.italic_p ( bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_Z end_ARG roman_exp { - italic_U ( bold_x ) } . Therefore, it follows immediately that 𝐱logp(𝐱)=𝐱{U(𝐱)logZ}=𝐱U(𝐱).subscript𝐱𝑝𝐱subscript𝐱𝑈𝐱𝑍subscript𝐱𝑈𝐱\nabla_{\mathbf{x}}\log p(\mathbf{x})=\nabla_{\mathbf{x}}\Big{\{}-U(\mathbf{x}% )-\log Z\Big{\}}=-\nabla_{\mathbf{x}}U(\mathbf{x}).∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x ) = ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT { - italic_U ( bold_x ) - roman_log italic_Z } = - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_U ( bold_x ) . (74) Substituting Eqn (74) into Eqn (73) would yield 𝐱t+1=𝐱t+τ𝐱logp(𝐱)+στ𝐳subscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐱𝑝𝐱𝜎𝜏𝐳\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\nabla_{\mathbf{x}}\log p(\mathbf{x})+% \sigma\tau\mathbf{z}bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x ) + italic_σ italic_τ bold_z. Finally, if we choose σ=2/τ𝜎2𝜏\sigma=\sqrt{2/\tau}italic_σ = square-root start_ARG 2 / italic_τ end_ARG (for no particular reason), we will obtain 𝐱t+1=𝐱t+τ𝐱logp(𝐱t)+2τ𝐳t.subscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐱𝑝subscript𝐱𝑡2𝜏subscript𝐳𝑡\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\nabla_{\mathbf{x}}\log p(\mathbf{x}_{t})+% \sqrt{2\tau}\mathbf{z}_{t}.bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_τ end_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (75)

    3.2 (Stein’s) Score Function

    랭뱅 역학 방정식의 두 번째 성분은 기울기 𝐱logp(𝐱)subscript𝐱𝑝𝐱\nabla_{\mathbf{x}}\log p(\mathbf{x})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x )이다. 그것은 Stein's score function, denoted by

    𝐬𝜽(𝐱)=def𝐱logp𝜽(𝐱).subscript𝐬𝜽𝐱defsubscript𝐱subscript𝑝𝜽𝐱\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})\overset{\text{def}}{=}\nabla_{% \mathbf{x}}\log p_{\boldsymbol{\theta}}(\mathbf{x}).bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) overdef start_ARG = end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) . (76)

    우리는 Stein의 점수 함수를 ordinary score function로 정의하지 않도록 주의해야 한다.

    𝐬𝐱(𝜽)=def𝜽logp𝜽(𝐱).subscript𝐬𝐱𝜽defsubscript𝜽subscript𝑝𝜽𝐱\mathbf{s}_{\mathbf{x}}(\boldsymbol{\theta})\overset{\text{def}}{=}\nabla_{% \boldsymbol{\theta}}\log p_{\boldsymbol{\theta}}(\mathbf{x}).bold_s start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT ( bold_italic_θ ) overdef start_ARG = end_ARG ∇ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) . (77)

    통상 스코어 함수는 로그-우도의 그래디언트(wrt𝜽𝜽\boldsymbol{\theta}bold_italic_θ)이다. 대조적으로, 스타인의 점수 함수는 그래디언트 wrt 데이터 포인트 𝐱𝐱\mathbf{x}bold_x이다. 최대 우도 추정은 보통 점수 함수를 사용하는 반면, 랭뱅 역학은 스타인의 점수 함수를 사용한다. 그러나 확산 문헌의 대부분의 사람들은 스타인의 점수 함수를 점수 함수로 부르므로 우리는 이 문화를 따른다. The "score function in Langevin dynamics is more accurately known the Stein's score function.

    점수 함수를 이해하는 방법은 데이터 𝐱𝐱\mathbf{x}bold_x에 대한 그래디언트임을 기억하는 것이다. 임의의 고차원 분포 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )에 대해, 그래디언트는 우리에게 벡터 필드를 줄 것이다

    𝐱logp(𝐱)=a vector field=[logp(𝐱)x,logp(𝐱)y]Tsubscript𝐱𝑝𝐱a vector fieldsuperscript𝑝𝐱𝑥𝑝𝐱𝑦𝑇\nabla_{\mathbf{x}}\log p(\mathbf{x})=\text{a vector field}=\left[\frac{% \partial\log p(\mathbf{x})}{\partial x},\;\;\frac{\partial\log p(\mathbf{x})}{% \partial y}\right]^{T}∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x ) = a vector field = [ divide start_ARG ∂ roman_log italic_p ( bold_x ) end_ARG start_ARG ∂ italic_x end_ARG , divide start_ARG ∂ roman_log italic_p ( bold_x ) end_ARG start_ARG ∂ italic_y end_ARG ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (78)

    Let’s consider two examples. Example. If p(x)𝑝𝑥p(x)italic_p ( italic_x ) is a Gaussian with p(x)=12πσ2e(xμ)22σ2𝑝𝑥12𝜋superscript𝜎2superscript𝑒superscript𝑥𝜇22superscript𝜎2p(x)=\frac{1}{\sqrt{2\pi\sigma^{2}}}e^{-\frac{(x-\mu)^{2}}{2\sigma^{2}}}italic_p ( italic_x ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_μ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT, then s(x)=xlogp(x)=(xμ)σ2.𝑠𝑥subscript𝑥𝑝𝑥𝑥𝜇superscript𝜎2s(x)=\nabla_{x}\log p(x)=-\frac{(x-\mu)}{\sigma^{2}}.italic_s ( italic_x ) = ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x ) = - divide start_ARG ( italic_x - italic_μ ) end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

    Example. If p(x)𝑝𝑥p(x)italic_p ( italic_x ) is a Gaussian mixture with p(x)=i=1Nπi12πσi2e(xμi)22σi2𝑝𝑥superscriptsubscript𝑖1𝑁subscript𝜋𝑖12𝜋superscriptsubscript𝜎𝑖2superscript𝑒superscript𝑥subscript𝜇𝑖22superscriptsubscript𝜎𝑖2p(x)=\sum_{i=1}^{N}\pi_{i}\frac{1}{\sqrt{2\pi\sigma_{i}^{2}}}e^{-\frac{(x-\mu_% {i})^{2}}{2\sigma_{i}^{2}}}italic_p ( italic_x ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT, then s(x)=xlogp(x)=j=1Nπj12πσj2e(xμj)22σj2(xμj)σj2i=1Nπi12πσi2e(xμi)22σi2.𝑠𝑥subscript𝑥𝑝𝑥superscriptsubscript𝑗1𝑁subscript𝜋𝑗12𝜋superscriptsubscript𝜎𝑗2superscript𝑒superscript𝑥subscript𝜇𝑗22superscriptsubscript𝜎𝑗2𝑥subscript𝜇𝑗superscriptsubscript𝜎𝑗2superscriptsubscript𝑖1𝑁subscript𝜋𝑖12𝜋superscriptsubscript𝜎𝑖2superscript𝑒superscript𝑥subscript𝜇𝑖22superscriptsubscript𝜎𝑖2s(x)=\nabla_{x}\log p(x)=-\frac{\sum_{j=1}^{N}\pi_{j}\frac{1}{\sqrt{2\pi\sigma% _{j}^{2}}}e^{-\frac{(x-\mu_{j})^{2}}{2\sigma_{j}^{2}}}\frac{(x-\mu_{j})}{% \sigma_{j}^{2}}}{\sum_{i=1}^{N}\pi_{i}\frac{1}{\sqrt{2\pi\sigma_{i}^{2}}}e^{-% \frac{(x-\mu_{i})^{2}}{2\sigma_{i}^{2}}}}.italic_s ( italic_x ) = ∇ start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_log italic_p ( italic_x ) = - divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT divide start_ARG ( italic_x - italic_μ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG italic_e start_POSTSUPERSCRIPT - divide start_ARG ( italic_x - italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_POSTSUPERSCRIPT end_ARG .

    위의 두 예제의 확률밀도함수와 해당 점수함수는 18와 같다.

    Refer to caption Refer to caption
    (a) 𝒩(1,1)𝒩11\mathcal{N}(1,1)caligraphic_N ( 1 , 1 ) (b) 0.6𝒩(2,0.52)+0.4𝒩(2,0.22)0.6𝒩2superscript0.520.4𝒩2superscript0.220.6\mathcal{N}(2,0.5^{2})+0.4\mathcal{N}(-2,0.2^{2})0.6 caligraphic_N ( 2 , 0.5 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) + 0.4 caligraphic_N ( - 2 , 0.2 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
    도 18:점수 함수의 예

    Geometric Interpretations of the Score Function.

    • 벡터의 크기는 logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x )의 변화가 가장 큰 곳에서 가장 강하다. 따라서, logp(𝐱)𝑝𝐱\log p(\mathbf{x})roman_log italic_p ( bold_x )가 피크에 가까운 영역에서는 대부분 매우 약한 그래디언트가 될 것이다.

    • 벡터 필드는 how a data point should travel in the contour map. 19에서 우리는 가우시안 혼합물(두 개의 가우시안 포함)의 등고선 지도를 보여준다. 우리는 벡터장을 나타내기 위해 화살표를 그린다. 이제 우리가 공간에 살고 있는 데이터 포인트를 고려한다면, 랭뱅 역학 방정식은 기본적으로 데이터 포인트를 벡터장이 가리키는 방향을 따라 분지를 향해 이동할 것이다.

    • 물리학에서 점수 함수는 "drift"과 같다. 이 이름은 확산 입자가 가장 낮은 에너지 상태로 어떻게 흘러가야 하는지를 시사한다.

    Refer to caption Refer to caption
    (a) vector field of 𝐱logp(𝐱)subscript𝐱𝑝𝐱\nabla_{\mathbf{x}}\log p(\mathbf{x})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p ( bold_x ) (b) 𝐱tsubscript𝐱𝑡\mathbf{x}_{t}bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT trajectory
    Figure 19: score function의 contour map, and the corresponding trajectory of two samples.

    3.3 Score Matching Techniques

    Langevin dynamics에서 가장 어려운 문제는 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )에 대한 접근 권한이 없기 때문에 𝐱p(𝐱)subscript𝐱𝑝𝐱\nabla_{\mathbf{x}}p(\mathbf{x})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_p ( bold_x )를 어떻게 구할 것인가이다. 슈타인 점수 함수의 정의를 떠올려 보자.

    𝐬𝜽(𝐱)=def𝐱p(𝐱),subscript𝐬𝜽𝐱defsubscript𝐱𝑝𝐱\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})\overset{\text{def}}{=}\nabla_{% \mathbf{x}}p(\mathbf{x}),bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) overdef start_ARG = end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_p ( bold_x ) , (79)

    여기서, 𝐬𝜽subscript𝐬𝜽\mathbf{s}_{\boldsymbol{\theta}}bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT가 네트워크를 통해 구현될 것임을 나타내기 위해 첨자 𝜽𝜽\boldsymbol{\theta}bold_italic_θ를 넣는다. 위 방정식의 오른쪽은 알려져 있지 않기 때문에, 우리는 그것을 근사화하기 위한 몇 가지 값싸고 더러운 방법이 필요하다. 이 절에서는 두 가지 근사치에 대해 간략하게 논의한다.

    Explicit Score-Matching. 데이터세트 𝒳={𝐱1,,𝐱M}𝒳subscript𝐱1subscript𝐱𝑀\mathcal{X}=\{\mathbf{x}_{1},\ldots,\mathbf{x}_{M}\}caligraphic_X = { bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_M end_POSTSUBSCRIPT }가 주어졌다고 가정하자. 사람들이 생각해 낸 해는 분포를 정의하여 고전적인 커널 밀도 추정을 고려하는 것이다.

    q(𝐱)=1Mm=1M1hK(𝐱𝐱mh),𝑞𝐱1𝑀superscriptsubscript𝑚1𝑀1𝐾𝐱subscript𝐱𝑚q(\mathbf{x})=\frac{1}{M}\sum_{m=1}^{M}\frac{1}{h}K\left(\frac{\mathbf{x}-% \mathbf{x}_{m}}{h}\right),italic_q ( bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_h end_ARG italic_K ( divide start_ARG bold_x - bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG ) , (80)

    여기서 hhitalic_h는 커널 함수 K()𝐾K(\cdot)italic_K ( ⋅ ), 𝐱msubscript𝐱𝑚\mathbf{x}_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT는 트레이닝 세트 내의 m𝑚mitalic_m-번째 샘플일 뿐이다. 20는 커널 밀도 추정의 아이디어를 예시한다. 왼쪽에 표시된 만화 그림에서 서로 다른 데이터 포인트 𝐱msubscript𝐱𝑚\mathbf{x}_{m}bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT를 중심으로 여러 개의 커널 K()𝐾K(\cdot)italic_K ( ⋅ )를 보여준다. 이 모든 개별 커널의 합은 전체 커널 밀도 추정치 q(𝐱)𝑞𝐱q(\mathbf{x})italic_q ( bold_x )를 제공한다. 오른쪽은 실제 히스토그램과 해당 커널 밀도 추정치를 보여준다. 우리는 q(𝐱)𝑞𝐱q(\mathbf{x})italic_q ( bold_x )가 기껏해야 알려지지 않은 진정한 데이터 분포 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )에 대한 근사치라고 지적한다.

    Refer to caption Refer to caption
    그림 20:Illustration of kernel density estimation.

    q(𝐱)𝑞𝐱q(\mathbf{x})italic_q ( bold_x )는 결코 접근할 수 없는 p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x )에 대한 근사치이므로, q(𝐱)𝑞𝐱q(\mathbf{x})italic_q ( bold_x )에 기초하여 𝐬𝜽(𝐱)subscript𝐬𝜽𝐱\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x )를 학습할 수 있다. 이것은 네트워크를 훈련시키는 데 사용될 수 있는 손실 함수의 다음과 같은 정의로 이어진다. The explicit score matching JESM(𝜽)subscript𝐽ESM𝜽\displaystyle J_{\text{ESM}}(\boldsymbol{\theta})italic_J start_POSTSUBSCRIPT ESM end_POSTSUBSCRIPT ( bold_italic_θ ) =def𝔼q(𝐱)𝐬𝜽(𝐱)𝐱logq(𝐱)2defsubscript𝔼𝑞𝐱superscriptnormsubscript𝐬𝜽𝐱subscript𝐱𝑞𝐱2\displaystyle\overset{\text{def}}{=}\mathbb{E}_{q(\mathbf{x})}\|\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})-\nabla_{\mathbf{x}}\log q(\mathbf{x})\|^{2}overdef start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x ) end_POSTSUBSCRIPT ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT =𝐬𝜽(𝐱)𝐱logq(𝐱)2[1Mm=1M1hK(𝐱𝐱mh)]𝑑𝐱absentsuperscriptnormsubscript𝐬𝜽𝐱subscript𝐱𝑞𝐱2delimited-[]1𝑀superscriptsubscript𝑚1𝑀1𝐾𝐱subscript𝐱𝑚differential-d𝐱\displaystyle=\int\|\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})-\nabla_{% \mathbf{x}}\log q(\mathbf{x})\|^{2}\left[\frac{1}{M}\sum_{m=1}^{M}\frac{1}{h}K% \left(\frac{\mathbf{x}-\mathbf{x}_{m}}{h}\right)\right]d\mathbf{x}= ∫ ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_h end_ARG italic_K ( divide start_ARG bold_x - bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG ) ] italic_d bold_x =1Mm=1M𝐬𝜽(𝐱)𝐱logq(𝐱)21hK(𝐱𝐱mh)𝑑𝐱.absent1𝑀superscriptsubscript𝑚1𝑀superscriptnormsubscript𝐬𝜽𝐱subscript𝐱𝑞𝐱21𝐾𝐱subscript𝐱𝑚differential-d𝐱\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\int\|\mathbf{s}_{\boldsymbol{\theta}}(% \mathbf{x})-\nabla_{\mathbf{x}}\log q(\mathbf{x})\|^{2}\frac{1}{h}K\left(\frac% {\mathbf{x}-\mathbf{x}_{m}}{h}\right)d\mathbf{x}.= divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_m = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT ∫ ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_h end_ARG italic_K ( divide start_ARG bold_x - bold_x start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_h end_ARG ) italic_d bold_x . (82)

    그래서 우리는 네트워크를 훈련하는 데 사용할 수 있는 손실 함수를 유도했다. 일단 네트워크 𝐬𝜽subscript𝐬𝜽\mathbf{s}_{\boldsymbol{\theta}}bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT를 훈련시키면, 이를 랑게빈 역학 방정식에서 대체하여 재귀성을 구할 수 있다:

    𝐱t+1=𝐱t+τ𝐬𝜽(𝐱t)+2τ𝐳.subscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐬𝜽subscript𝐱𝑡2𝜏𝐳\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x% }_{t})+\sqrt{2\tau}\mathbf{z}.bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_τ end_ARG bold_z . (83)

    명시적 점수 매칭의 문제는 커널 밀도 추정이 참 분포의 상당히 열악한 비모수 추정이라는 것이다. 특히 표본 수가 제한적이고 표본이 고차원 공간에 존재할 경우 커널 밀도 추정 성능이 떨어질 수 있다.

    Denoising Score Matching. 명시적 점수 매칭의 잠재적인 단점들을 고려하여, 이제 우리는 디노이징 점수 매칭(DSM)으로 알려진 보다 대중적인 점수 매칭을 소개한다. DSM에서 손실 함수는 다음과 같이 정의된다.

    JDSM(𝜽)=def𝔼q(𝐱,𝐱)[12𝐬𝜽(𝐱)𝐱q(𝐱|𝐱)2]J_{\text{DSM}}(\boldsymbol{\theta})\overset{\text{def}}{=}\mathbb{E}_{q(% \mathbf{x},\mathbf{x}^{\prime})}\left[\frac{1}{2}\left\|\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})-\nabla_{\mathbf{x}}q(\mathbf{x}|\mathbf{x}^{% \prime})\right\|^{2}\right]italic_J start_POSTSUBSCRIPT DSM end_POSTSUBSCRIPT ( bold_italic_θ ) overdef start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (84)

    여기서 중요한 차이점은 분포 q(𝐱)𝑞𝐱q(\mathbf{x})italic_q ( bold_x )를 조건부 분포 q(𝐱|𝐱)𝑞conditional𝐱superscript𝐱q(\mathbf{x}|\mathbf{x}^{\prime})italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )로 대체한다는 것이다. 전자는 예를 들어 커널 밀도 추정을 통한 근사치를 필요로 하는 반면 후자는 그렇지 않다. 예를 들어 설명하면 다음과 같다.

    In the special case where q(𝐱|𝐱)=𝒩(𝐱|𝐱,σ2)𝑞conditional𝐱superscript𝐱𝒩conditional𝐱superscript𝐱superscript𝜎2q(\mathbf{x}|\mathbf{x}^{\prime})=\mathcal{N}(\mathbf{x}\;|\;\mathbf{x}^{% \prime},\sigma^{2})italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = caligraphic_N ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we can let 𝐱=𝐱+σ𝐳𝐱superscript𝐱𝜎𝐳\mathbf{x}=\mathbf{x}^{\prime}+\sigma\mathbf{z}bold_x = bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_σ bold_z. This will give us

    𝐱logq(𝐱|𝐱)subscript𝐱𝑞conditional𝐱superscript𝐱\displaystyle\nabla_{\mathbf{x}}\log q(\mathbf{x}|\mathbf{x}^{\prime})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) =𝐱log1(2πσ2)dexp{𝐱𝐱22σ2}absentsubscript𝐱1superscript2𝜋superscript𝜎2𝑑superscriptnorm𝐱superscript𝐱22superscript𝜎2\displaystyle=\nabla_{\mathbf{x}}\log\frac{1}{(\sqrt{2\pi\sigma^{2}})^{d}}\exp% \left\{-\frac{\|\mathbf{x}-\mathbf{x}^{\prime}\|^{2}}{2\sigma^{2}}\right\}= ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log divide start_ARG 1 end_ARG start_ARG ( square-root start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG roman_exp { - divide start_ARG ∥ bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG }
    =𝐱{𝐱𝐱22σ2log(2πσ2)d}\displaystyle=\nabla_{\mathbf{x}}\left\{-\frac{\|\mathbf{x}-\mathbf{x}^{\prime% }\|^{2}}{2\sigma^{2}}-\log(\sqrt{2\pi\sigma^{2}})^{d}\right\}= ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT { - divide start_ARG ∥ bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - roman_log ( square-root start_ARG 2 italic_π italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }
    =𝐱𝐱σ2=𝐳σ2.absent𝐱superscript𝐱superscript𝜎2𝐳superscript𝜎2\displaystyle=-\frac{\mathbf{x}-\mathbf{x}^{\prime}}{\sigma^{2}}=-\frac{% \mathbf{z}}{\sigma^{2}}.= - divide start_ARG bold_x - bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = - divide start_ARG bold_z end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

    그 결과, 상기 디노이징 스코어 매칭의 손실 함수는

    JDSM(𝜽)subscript𝐽DSM𝜽\displaystyle J_{\text{DSM}}(\boldsymbol{\theta})italic_J start_POSTSUBSCRIPT DSM end_POSTSUBSCRIPT ( bold_italic_θ ) =def𝔼q(𝐱,𝐱)[12𝐬𝜽(𝐱)𝐱q(𝐱|𝐱)2]\displaystyle\overset{\text{def}}{=}\mathbb{E}_{q(\mathbf{x},\mathbf{x}^{% \prime})}\left[\frac{1}{2}\left\|\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})-% \nabla_{\mathbf{x}}q(\mathbf{x}|\mathbf{x}^{\prime})\right\|^{2}\right]overdef start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
    =𝔼q(𝐱)[12𝐬𝜽(𝐱+σ𝐳)+𝐳σ22].absentsubscript𝔼𝑞superscript𝐱delimited-[]12superscriptnormsubscript𝐬𝜽superscript𝐱𝜎𝐳𝐳superscript𝜎22\displaystyle=\mathbb{E}_{q(\mathbf{x}^{\prime})}\left[\frac{1}{2}\left\|% \mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}^{\prime}+\sigma\mathbf{z})+\frac{% \mathbf{z}}{\sigma^{2}}\right\|^{2}\right].= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_σ bold_z ) + divide start_ARG bold_z end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] .

    If we replace the dummy variable 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT by 𝐱𝐱\mathbf{x}bold_x, and we note that sampling from q(𝐱)𝑞𝐱q(\mathbf{x})italic_q ( bold_x ) can be replaced by sampling from p(𝐱)𝑝𝐱p(\mathbf{x})italic_p ( bold_x ) when we are given a training dataset, we can conclude the following. The Denoising Score Matching has a loss function defined as JDSM(𝜽)=𝔼p(𝐱)[12𝐬𝜽(𝐱+σ𝐳)+𝐳σ22]subscript𝐽DSM𝜽subscript𝔼𝑝𝐱delimited-[]12superscriptnormsubscript𝐬𝜽𝐱𝜎𝐳𝐳superscript𝜎22J_{\text{DSM}}(\boldsymbol{\theta})=\mathbb{E}_{p(\mathbf{x})}\left[\frac{1}{2% }\left\|\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}+\sigma\mathbf{z})+\frac{% \mathbf{z}}{\sigma^{2}}\right\|^{2}\right]italic_J start_POSTSUBSCRIPT DSM end_POSTSUBSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_p ( bold_x ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x + italic_σ bold_z ) + divide start_ARG bold_z end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] (85)

    Eqn(85)에 대한 아름다움은 해석이 가능하다는 것이다. 수량 𝐱+σ𝐳𝐱𝜎𝐳\mathbf{x}+\sigma\mathbf{z}bold_x + italic_σ bold_z는 깨끗한 이미지 𝐱𝐱\mathbf{x}bold_x에 노이즈 σ𝐳𝜎𝐳\sigma\mathbf{z}italic_σ bold_z를 효과적으로 추가하고 있다. 점수 함수 𝐬𝜽subscript𝐬𝜽\mathbf{s}_{\boldsymbol{\theta}}bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT는 이 노이즈 이미지를 취하여 노이즈 𝐳σ2𝐳superscript𝜎2\frac{\mathbf{z}}{\sigma^{2}}divide start_ARG bold_z end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG를 예측하도록 되어 있다. 노이즈를 예측하는 것은 노이즈 제거와 동일하며, 이는 노이즈 제거된 이미지와 예측된 노이즈가 노이즈 관찰을 제공하기 때문이다. 따라서, Eqn(85)는 denoising step이다. 21는 스코어 함수 𝐬𝜽(𝐱)subscript𝐬𝜽𝐱\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x )의 트레이닝 절차를 예시한다.

    Refer to caption
    도 21: Training of 𝐬𝜽subscript𝐬𝜽\mathbf{s}_{\boldsymbol{\theta}}bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT for denoising score matching. 네트워크 𝐬𝜽subscript𝐬𝜽\mathbf{s}_{\boldsymbol{\theta}}bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT는 노이즈를 추정하도록 훈련된다.

    training 단계는 다음과 같이 간단히 설명할 수 있습니다.

    𝜽*=argmin𝜽1L=1L12𝐬𝜽(𝐱()+σ𝐳())+𝐳()σ22,where𝐳()𝒩(0,𝐈).formulae-sequencesuperscript𝜽𝜽argmin1𝐿superscriptsubscript1𝐿12superscriptnormsubscript𝐬𝜽superscript𝐱𝜎superscript𝐳superscript𝐳superscript𝜎22wheresimilar-tosuperscript𝐳𝒩0𝐈\boldsymbol{\theta}^{*}=\mathop{\underset{\boldsymbol{\theta}}{\mbox{argmin}}}% \;\;\frac{1}{L}\sum_{\ell=1}^{L}\frac{1}{2}\left\|\mathbf{s}_{\boldsymbol{% \theta}}\left(\mathbf{x}^{(\ell)}+\sigma\mathbf{z}^{(\ell)}\right)+\frac{% \mathbf{z}^{(\ell)}}{\sigma^{2}}\right\|^{2},\qquad\text{where}\quad\mathbf{z}% ^{(\ell)}\sim\mathcal{N}(0,\mathbf{I}).bold_italic_θ start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT = start_BIGOP underbold_italic_θ start_ARG argmin end_ARG end_BIGOP divide start_ARG 1 end_ARG start_ARG italic_L end_ARG ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT + italic_σ bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ) + divide start_ARG bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , where bold_z start_POSTSUPERSCRIPT ( roman_ℓ ) end_POSTSUPERSCRIPT ∼ caligraphic_N ( 0 , bold_I ) . (86)

    여기서 더 큰 질문은 Eqn(84)이 애초에 왜 말이 되는가이다. 이는 명시적 스코어 매칭 손실과 디노이징 스코어 매칭 손실 사이의 동등성을 통해 답변될 필요가 있다.

    Theorem [Vincent [9]] For up to a constant C𝐶Citalic_C which is independent of the variable 𝜽𝜽\boldsymbol{\theta}bold_italic_θ, it holds that JDSM(𝜽)=JESM(𝜽)+C.subscript𝐽DSM𝜽subscript𝐽ESM𝜽𝐶J_{\text{DSM}}(\boldsymbol{\theta})=J_{\text{ESM}}(\boldsymbol{\theta})+C.italic_J start_POSTSUBSCRIPT DSM end_POSTSUBSCRIPT ( bold_italic_θ ) = italic_J start_POSTSUBSCRIPT ESM end_POSTSUBSCRIPT ( bold_italic_θ ) + italic_C . (87)

    명시적 점수 매칭과 노이즈 제거 점수 매칭 사이의 동등성은 주요 발견이다. 아래의 증명은 빈센트 2011의 원작을 바탕으로 한다.

    Proof of Eqn (87) We start with the explicit score matching loss function, which is given by JESM(𝜽)subscript𝐽ESM𝜽\displaystyle J_{\text{ESM}}(\boldsymbol{\theta})italic_J start_POSTSUBSCRIPT ESM end_POSTSUBSCRIPT ( bold_italic_θ ) =𝔼q(𝐱)[12𝐬𝜽(𝐱)𝐱logq(𝐱)2]absentsubscript𝔼𝑞𝐱delimited-[]12superscriptnormsubscript𝐬𝜽𝐱subscript𝐱𝑞𝐱2\displaystyle=\mathbb{E}_{q(\mathbf{x})}\left[\frac{1}{2}\left\|\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})-\nabla_{\mathbf{x}}\log q(\mathbf{x})\right\|% ^{2}\right]= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼q(𝐱)[12𝐬𝜽(𝐱)2𝐬𝜽(𝐱)T𝐱logq(𝐱)+12𝐱logq(𝐱)2=defC1,independent of 𝜽].absentsubscript𝔼𝑞𝐱delimited-[]12superscriptnormsubscript𝐬𝜽𝐱2subscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞𝐱defsubscript𝐶1independent of 𝜽12superscriptnormsubscript𝐱𝑞𝐱2\displaystyle=\mathbb{E}_{q(\mathbf{x})}\Big{[}\frac{1}{2}\left\|\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})\right\|^{2}-\mathbf{s}_{\boldsymbol{\theta}}(% \mathbf{x})^{T}\nabla_{\mathbf{x}}\log q(\mathbf{x})+\underset{\overset{\text{% def}}{=}C_{1},\text{independent of $\boldsymbol{\theta}$}}{\underbrace{\frac{1% }{2}\left\|\nabla_{\mathbf{x}}\log q(\mathbf{x})\right\|^{2}}}\Big{]}.= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) + start_UNDERACCENT overdef start_ARG = end_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , independent of bold_italic_θ end_UNDERACCENT start_ARG under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ] . Let’s zoom into the second term. We can show that 𝔼q(𝐱)[𝐬𝜽(𝐱)T𝐱logq(𝐱)]subscript𝔼𝑞𝐱delimited-[]subscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞𝐱\displaystyle\mathbb{E}_{q(\mathbf{x})}\left[\mathbf{s}_{\boldsymbol{\theta}}(% \mathbf{x})^{T}\nabla_{\mathbf{x}}\log q(\mathbf{x})\right]blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x ) end_POSTSUBSCRIPT [ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) ] =(𝐬𝜽(𝐱)T𝐱logq(𝐱))q(𝐱)𝑑𝐱,absentsubscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞𝐱𝑞𝐱differential-d𝐱\displaystyle=\int\left(\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\nabla% _{\mathbf{x}}\log q(\mathbf{x})\right)q(\mathbf{x})d\mathbf{x},= ∫ ( bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x ) ) italic_q ( bold_x ) italic_d bold_x , (expectation) =(𝐬𝜽(𝐱)T𝐱q(𝐱)q(𝐱))q(𝐱)𝑑𝐱,absentsubscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞𝐱cancel𝑞𝐱cancel𝑞𝐱differential-d𝐱\displaystyle=\int\left(\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\frac{% \nabla_{\mathbf{x}}q(\mathbf{x})}{\cancel{q(\mathbf{x})}}\right)\cancel{q(% \mathbf{x})}d\mathbf{x},= ∫ ( bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x ) end_ARG start_ARG cancel italic_q ( bold_x ) end_ARG ) cancel italic_q ( bold_x ) italic_d bold_x , (gradient) =𝐬𝜽(𝐱)T𝐱q(𝐱)𝑑𝐱.absentsubscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞𝐱differential-d𝐱\displaystyle=\int\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\nabla_{% \mathbf{x}}q(\mathbf{x})d\mathbf{x}.= ∫ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x ) italic_d bold_x . Next, we consider conditioning by recalling q(𝐱)=q(𝐱)q(𝐱|𝐱)𝑑𝐱𝑞𝐱𝑞superscript𝐱𝑞conditional𝐱superscript𝐱differential-dsuperscript𝐱q(\mathbf{x})=\int q(\mathbf{x}^{\prime})q(\mathbf{x}|\mathbf{x}^{\prime})d% \mathbf{x}^{\prime}italic_q ( bold_x ) = ∫ italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. This will give us 𝐬𝜽(𝐱)T𝐱q(𝐱)𝑑𝐱subscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞𝐱differential-d𝐱\displaystyle\int\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\nabla_{% \mathbf{x}}{\color[rgb]{0,0,1}q(\mathbf{x})}d\mathbf{x}∫ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x ) italic_d bold_x =𝐬𝜽(𝐱)T𝐱(q(𝐱)q(𝐱|𝐱)𝑑𝐱)=q(𝐱)d𝐱absentsubscript𝐬𝜽superscript𝐱𝑇subscript𝐱absent𝑞𝐱𝑞superscript𝐱𝑞conditional𝐱superscript𝐱differential-dsuperscript𝐱𝑑𝐱\displaystyle=\int\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\nabla_{% \mathbf{x}}\underset{=q(\mathbf{x})}{\underbrace{\left(\int q(\mathbf{x}^{% \prime})q(\mathbf{x}|\mathbf{x}^{\prime})d\mathbf{x}^{\prime}\right)}}d\mathbf% {x}= ∫ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT start_UNDERACCENT = italic_q ( bold_x ) end_UNDERACCENT start_ARG under⏟ start_ARG ( ∫ italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG end_ARG italic_d bold_x (conditional) =𝐬𝜽(𝐱)T(q(𝐱)𝐱q(𝐱|𝐱)𝑑𝐱)𝑑𝐱absentsubscript𝐬𝜽superscript𝐱𝑇𝑞superscript𝐱subscript𝐱𝑞conditional𝐱superscript𝐱differential-dsuperscript𝐱differential-d𝐱\displaystyle=\int\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\left(\int q% (\mathbf{x}^{\prime}){\color[rgb]{0,0,1}\nabla_{\mathbf{x}}}q(\mathbf{x}|% \mathbf{x}^{\prime})d\mathbf{x}^{\prime}\right)d\mathbf{x}= ∫ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∫ italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x (move gradient) =𝐬𝜽(𝐱)T(q(𝐱)𝐱q(𝐱|𝐱)×q(𝐱|𝐱)q(𝐱|𝐱)𝑑𝐱)𝑑𝐱absentsubscript𝐬𝜽superscript𝐱𝑇𝑞superscript𝐱subscript𝐱𝑞conditional𝐱superscript𝐱𝑞conditional𝐱superscript𝐱𝑞conditional𝐱superscript𝐱differential-dsuperscript𝐱differential-d𝐱\displaystyle=\int\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\left(\int q% (\mathbf{x}^{\prime})\nabla_{\mathbf{x}}q(\mathbf{x}|\mathbf{x}^{\prime})% \times{\color[rgb]{0,0,1}\frac{q(\mathbf{x}|\mathbf{x}^{\prime})}{q(\mathbf{x}% |\mathbf{x}^{\prime})}}d\mathbf{x}^{\prime}\right)d\mathbf{x}= ∫ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∫ italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) × divide start_ARG italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG italic_d bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x (multiple and divide) =𝐬𝜽(𝐱)Tq(𝐱)(𝐱q(𝐱|𝐱)q(𝐱|𝐱))=𝐱logq(𝐱|𝐱)q(𝐱|𝐱)𝑑𝐱𝑑𝐱absentsubscript𝐬𝜽superscript𝐱𝑇𝑞superscript𝐱absentsubscript𝐱𝑞conditional𝐱superscript𝐱subscript𝐱𝑞conditional𝐱superscript𝐱𝑞conditional𝐱superscript𝐱𝑞conditional𝐱superscript𝐱differential-dsuperscript𝐱differential-d𝐱\displaystyle=\int\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\int q(% \mathbf{x}^{\prime})\underset{=\nabla_{\mathbf{x}}\log q(\mathbf{x}|\mathbf{x}% ^{\prime})}{\underbrace{\left(\frac{\nabla_{\mathbf{x}}q(\mathbf{x}|\mathbf{x}% ^{\prime})}{q(\mathbf{x}|\mathbf{x}^{\prime})}\right)}}q(\mathbf{x}|\mathbf{x}% ^{\prime})d\mathbf{x}^{\prime}d\mathbf{x}= ∫ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∫ italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_UNDERACCENT = ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG under⏟ start_ARG ( divide start_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG ) end_ARG end_ARG italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d bold_x (rearrange terms) =𝐬𝜽(𝐱)T(q(𝐱)(𝐱logq(𝐱|𝐱))q(𝐱|𝐱)𝑑𝐱)𝑑𝐱absentsubscript𝐬𝜽superscript𝐱𝑇𝑞superscript𝐱subscript𝐱𝑞conditional𝐱superscript𝐱𝑞conditional𝐱superscript𝐱differential-dsuperscript𝐱differential-d𝐱\displaystyle=\int\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\left(\int q% (\mathbf{x}^{\prime})\Big{(}\nabla_{\mathbf{x}}\log q(\mathbf{x}|\mathbf{x}^{% \prime})\Big{)}q(\mathbf{x}|\mathbf{x}^{\prime})d\mathbf{x}^{\prime}\right)d% \mathbf{x}= ∫ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∫ italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ( ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_d bold_x =q(𝐱|𝐱)q(𝐱)=q(𝐱,𝐱)(𝐬𝜽(𝐱)T𝐱logq(𝐱|𝐱))𝑑𝐱𝑑𝐱absentabsent𝑞𝐱superscript𝐱𝑞conditional𝐱superscript𝐱𝑞superscript𝐱subscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞conditional𝐱superscript𝐱differential-dsuperscript𝐱differential-d𝐱\displaystyle=\int\int\underset{=q(\mathbf{x},\mathbf{x}^{\prime})}{% \underbrace{q(\mathbf{x}|\mathbf{x}^{\prime})q(\mathbf{x}^{\prime})}}\Big{(}% \mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}\nabla_{\mathbf{x}}\log q(% \mathbf{x}|\mathbf{x}^{\prime})\Big{)}d\mathbf{x}^{\prime}d\mathbf{x}= ∫ ∫ start_UNDERACCENT = italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_UNDERACCENT start_ARG under⏟ start_ARG italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) italic_q ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG end_ARG ( bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) italic_d bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d bold_x (move integration) =𝔼q(𝐱,𝐱)[𝐬𝜽(𝐱)T𝐱logq(𝐱|𝐱)].absentsubscript𝔼𝑞𝐱superscript𝐱delimited-[]subscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞conditional𝐱superscript𝐱\displaystyle=\mathbb{E}_{q(\mathbf{x},\mathbf{x}^{\prime})}\left[\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})^{T}\nabla_{\mathbf{x}}\log q(\mathbf{x}|% \mathbf{x}^{\prime})\right].= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] . So, if we substitute this result back to the definition of ESM, we can show that JESM(𝜽)=𝔼q(𝐱)[12𝐬𝜽(𝐱)2]𝔼q(𝐱,𝐱)[𝐬𝜽(𝐱)T𝐱logq(𝐱|𝐱)]+C1.subscript𝐽ESM𝜽subscript𝔼𝑞𝐱delimited-[]12superscriptnormsubscript𝐬𝜽𝐱2subscript𝔼𝑞𝐱superscript𝐱delimited-[]subscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞conditional𝐱superscript𝐱subscript𝐶1\displaystyle J_{\text{ESM}}(\boldsymbol{\theta})=\mathbb{E}_{q(\mathbf{x})}% \Big{[}\frac{1}{2}\left\|\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})\right\|^% {2}\Big{]}-\mathbb{E}_{q(\mathbf{x},\mathbf{x}^{\prime})}\left[\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})^{T}\nabla_{\mathbf{x}}\log q(\mathbf{x}|% \mathbf{x}^{\prime})\right]+C_{1}.italic_J start_POSTSUBSCRIPT ESM end_POSTSUBSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . Comparing this with the definition of DSM, we can observe that JDSM(𝜽)subscript𝐽DSM𝜽\displaystyle J_{\text{DSM}}(\boldsymbol{\theta})italic_J start_POSTSUBSCRIPT DSM end_POSTSUBSCRIPT ( bold_italic_θ ) =def𝔼q(𝐱,𝐱)[12𝐬𝜽(𝐱)𝐱q(𝐱|𝐱)2]\displaystyle\overset{\text{def}}{=}\mathbb{E}_{q(\mathbf{x},\mathbf{x}^{% \prime})}\left[\frac{1}{2}\left\|\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})-% \nabla_{\mathbf{x}}q(\mathbf{x}|\mathbf{x}^{\prime})\right\|^{2}\right]overdef start_ARG = end_ARG blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) - ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] =𝔼q(𝐱,𝐱)[12𝐬𝜽(𝐱)2𝐬𝜽(𝐱)T𝐱logq(𝐱|𝐱)+12𝐱logq(𝐱|𝐱)2=defC2,independent of 𝜽]\displaystyle=\mathbb{E}_{q(\mathbf{x},\mathbf{x}^{\prime})}\Big{[}\frac{1}{2}% \left\|\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})\right\|^{2}-\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})^{T}\nabla_{\mathbf{x}}\log q(\mathbf{x}|% \mathbf{x}^{\prime})+\underset{\overset{\text{def}}{=}C_{2},\text{independent % of $\boldsymbol{\theta}$}}{\underbrace{\frac{1}{2}\left\|\nabla_{\mathbf{x}}% \log q(\mathbf{x}|\mathbf{x}^{\prime})\right\|^{2}}}\Big{]}= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + start_UNDERACCENT overdef start_ARG = end_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , independent of bold_italic_θ end_UNDERACCENT start_ARG under⏟ start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ] =𝔼q(𝐱)[12𝐬𝜽(𝐱)2]𝔼q(𝐱,𝐱)[𝐬𝜽(𝐱)T𝐱logq(𝐱|𝐱)]+C2.absentsubscript𝔼𝑞𝐱delimited-[]12superscriptnormsubscript𝐬𝜽𝐱2subscript𝔼𝑞𝐱superscript𝐱delimited-[]subscript𝐬𝜽superscript𝐱𝑇subscript𝐱𝑞conditional𝐱superscript𝐱subscript𝐶2\displaystyle=\mathbb{E}_{q(\mathbf{x})}\Big{[}\frac{1}{2}\left\|\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x})\right\|^{2}\Big{]}-\mathbb{E}_{q(\mathbf{x},% \mathbf{x}^{\prime})}\left[\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x})^{T}% \nabla_{\mathbf{x}}\log q(\mathbf{x}|\mathbf{x}^{\prime})\right]+C_{2}.= blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x ) end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∥ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] - blackboard_E start_POSTSUBSCRIPT italic_q ( bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_q ( bold_x | bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . Therefore, we conclude that JDSM(𝜽)=JESM(𝜽)C1+C2.subscript𝐽DSM𝜽subscript𝐽ESM𝜽subscript𝐶1subscript𝐶2J_{\text{DSM}}(\boldsymbol{\theta})=J_{\text{ESM}}(\boldsymbol{\theta})-C_{1}+% C_{2}.italic_J start_POSTSUBSCRIPT DSM end_POSTSUBSCRIPT ( bold_italic_θ ) = italic_J start_POSTSUBSCRIPT ESM end_POSTSUBSCRIPT ( bold_italic_θ ) - italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

    inference의 경우 점수 추정기 𝐬𝜽subscript𝐬𝜽\mathbf{s}_{\boldsymbol{\theta}}bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT를 이미 학습했다고 가정합니다. 이미지를 생성하기 위해, t=1,,T𝑡1𝑇t=1,\ldots,Titalic_t = 1 , … , italic_T에 대해 다음과 같은 절차를 수행한다:

    𝐱t+1=𝐱t+τ𝐬𝜽(𝐱t)+2τ𝐳t,where𝐳t𝒩(0,𝐈).formulae-sequencesubscript𝐱𝑡1subscript𝐱𝑡𝜏subscript𝐬𝜽subscript𝐱𝑡2𝜏subscript𝐳𝑡wheresimilar-tosubscript𝐳𝑡𝒩0𝐈\mathbf{x}_{t+1}=\mathbf{x}_{t}+\tau\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x% }_{t})+\sqrt{2\tau}\mathbf{z}_{t},\qquad\text{where}\quad\mathbf{z}_{t}\sim% \mathcal{N}(0,\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_τ bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_τ end_ARG bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , where bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) . (88)

    Congratulations! 우린 끝났어 이것은 모두 점수 기반 생성 모델에 관한 것입니다.

    점수 매칭에 대한 추가 판독값은 빈센트의 기술 보고서 [9]로 시작해야 한다. 최근 문헌에서 매우 인기 있는 논문은 Song and Ermon [15], 그들의 후속 작업 [16], [8]이다. 실제로, 스코어 함수를 트레이닝하는 것은 노이즈 레벨들의 시퀀스를 고려함으로써 노이즈 스케줄을 필요로 한다. 다음 절에서 SDE를 폭발하는 분산을 설명할 때 이에 대해 간략히 논의할 것이다.

    4 Stochastic Differential Equation (SDE)

    지금까지 우리는 DDPM과 SMLD 관점을 통해 확산 반복을 유도했다. 이 절에서는 미분방정식의 렌즈를 통해 세 번째 관점을 소개하고자 한다. 왜 우리의 반복적인 계획들이 갑자기 복잡한 미분 방정식이 되는지는 분명하지 않을 수 있다. 따라서 방정식을 도출하기 전에 미분 방정식이 우리와 어떻게 관련될 수 있는지에 대해 간략하게 논의해야 한다.

    4.1 Motivating Examples

    Example 1. Simple First-Order ODE. Imagine that we are given a discrete-time algorithm with the iterations defined by the recursion: 𝐱i=(1βΔt2)𝐱i1,fori=1,2,,N,formulae-sequencesubscript𝐱𝑖1𝛽Δ𝑡2subscript𝐱𝑖1for𝑖12𝑁\mathbf{x}_{i}=\left(1-\frac{\beta\Delta t}{2}\right)\mathbf{x}_{i-1},\qquad% \text{for}\;\;i=1,2,\ldots,N,bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - divide start_ARG italic_β roman_Δ italic_t end_ARG start_ARG 2 end_ARG ) bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , for italic_i = 1 , 2 , … , italic_N , (89) for some hyperparameter β𝛽\betaitalic_β and a step-size parameter ΔtΔ𝑡\Delta troman_Δ italic_t. This recursion has nothing complicated: You give us 𝐱i1subscript𝐱𝑖1\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, we update and return you 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. If we assume a discretization scheme of a continuous time function 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) by letting 𝐱i=𝐱(iN)subscript𝐱𝑖𝐱𝑖𝑁\mathbf{x}_{i}=\mathbf{x}(\tfrac{i}{N})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x ( divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG ), Δt=1NΔ𝑡1𝑁\Delta t=\tfrac{1}{N}roman_Δ italic_t = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG, and t{0,1N,,N1N}𝑡01𝑁𝑁1𝑁t\in\{0,\tfrac{1}{N},\ldots,\tfrac{N-1}{N}\}italic_t ∈ { 0 , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG , … , divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG }, then we can rewrite the recursion as 𝐱(t+Δt)=(1βΔt2)𝐱(t).𝐱𝑡Δ𝑡1𝛽Δ𝑡2𝐱𝑡\displaystyle\mathbf{x}(t+\Delta t)=\left(1-\frac{\beta\Delta t}{2}\right)% \mathbf{x}(t).bold_x ( italic_t + roman_Δ italic_t ) = ( 1 - divide start_ARG italic_β roman_Δ italic_t end_ARG start_ARG 2 end_ARG ) bold_x ( italic_t ) . Rearranging the terms will give us 𝐱(t+Δt)𝐱(t)Δt=β2𝐱(t),𝐱𝑡Δ𝑡𝐱𝑡Δ𝑡𝛽2𝐱𝑡\displaystyle\frac{\mathbf{x}(t+\Delta t)-\mathbf{x}(t)}{\Delta t}=-\frac{% \beta}{2}\mathbf{x}(t),divide start_ARG bold_x ( italic_t + roman_Δ italic_t ) - bold_x ( italic_t ) end_ARG start_ARG roman_Δ italic_t end_ARG = - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG bold_x ( italic_t ) , where at the limit when Δt0Δ𝑡0\Delta t\rightarrow 0roman_Δ italic_t → 0, we can write the discrete equation as an ordinary differential equation (ODE) d𝐱(t)dt=β2𝐱(t).𝑑𝐱𝑡𝑑𝑡𝛽2𝐱𝑡\frac{d\mathbf{x}(t)}{dt}=-\frac{\beta}{2}\mathbf{x}(t).divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG bold_x ( italic_t ) . (90) Not only that, we can solve for an analytic solution for the ODE where the solution is given by 𝐱(t)=eβ2t.𝐱𝑡superscript𝑒𝛽2𝑡\mathbf{x}(t)=e^{-\frac{\beta}{2}t}.bold_x ( italic_t ) = italic_e start_POSTSUPERSCRIPT - divide start_ARG italic_β end_ARG start_ARG 2 end_ARG italic_t end_POSTSUPERSCRIPT . (91) If you don’t believe us, just substitute Eqn (91) into Eqn (90) and you can show that the equality holds. The power of the ODE is that it offers us an analytic solution. Instead of resorting to the iterative scheme (which will take hundreds to thousands of iterations), the analytic solution tells us exactly the behavior of the solution at any time t𝑡titalic_t. To illustrate this fact, we show in the figure below the trajectory of the solution 𝐱1,𝐱2,,𝐱i,,𝐱Nsubscript𝐱1subscript𝐱2subscript𝐱𝑖subscript𝐱𝑁\mathbf{x}_{1},\mathbf{x}_{2},\ldots,\mathbf{x}_{i},\ldots,\mathbf{x}_{N}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT defined by the algorithm. Here, we choose Δt=0.1Δ𝑡0.1\Delta t=0.1roman_Δ italic_t = 0.1. In the same plot, we directly plot the continuous-time solution 𝐱(t)=exp{βt/2}𝐱𝑡𝛽𝑡2\mathbf{x}(t)=\exp\{-\beta t/2\}bold_x ( italic_t ) = roman_exp { - italic_β italic_t / 2 } for arbitrary t𝑡titalic_t. As you can see, the analytic solution is exactly the same as the trajectory predicted by the iterative scheme.[Uncaptioned image]

    이 동기 부여 사례에서 우리가 관찰하는 것은 두 가지 흥미로운 사실입니다.

    • 이산-시간 반복 방식은 연속-시간 상미분 방정식으로 작성될 수 있다. 유한차분 방정식에 대해 우리는 재귀를 ODE로 바꿀 수 있다는 것이 밝혀졌다.

    • 단순한 ODE의 경우, 우리는 분석해를 닫힌 형태로 적을 수 있다. 더 복잡한 ODE는 분석 솔루션을 작성하기 어려울 것이다. 그러나 여전히 ODE 도구를 사용하여 솔루션의 동작을 분석할 수 있습니다. 우리는 또한 제한 솔루션 t0𝑡0t\rightarrow 0italic_t → 0를 유도할 수 있다.

    Example 2: Gradient Descent. Recall that a gradient descent algorithm for a (well-behaved) convex function f𝑓fitalic_f is the following recursion. For i=1,2,,N𝑖12𝑁i=1,2,\ldots,Nitalic_i = 1 , 2 , … , italic_N, do 𝐱i=𝐱i1βi1f(𝐱i1),subscript𝐱𝑖subscript𝐱𝑖1subscript𝛽𝑖1𝑓subscript𝐱𝑖1\mathbf{x}_{i}=\mathbf{x}_{i-1}-\beta_{i-1}\nabla f(\mathbf{x}_{i-1}),bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) , (92) for step-size parameter βisubscript𝛽𝑖\beta_{i}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Using the same discretization as we did in the previous example, we can show that (by letting βi1=β(t)Δtsubscript𝛽𝑖1𝛽𝑡Δ𝑡\beta_{i-1}=\beta(t)\Delta titalic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = italic_β ( italic_t ) roman_Δ italic_t): 𝐱i=𝐱i1βi1f(𝐱i1)subscript𝐱𝑖subscript𝐱𝑖1subscript𝛽𝑖1𝑓subscript𝐱𝑖1\displaystyle\mathbf{x}_{i}=\mathbf{x}_{i-1}-\beta_{i-1}\nabla f(\mathbf{x}_{i% -1})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) 𝐱(t+Δt)=𝐱(t)β(t)Δtf(𝐱(t))𝐱𝑡Δ𝑡𝐱𝑡𝛽𝑡Δ𝑡𝑓𝐱𝑡\displaystyle\qquad\Longrightarrow\qquad\mathbf{x}(t+\Delta t)=\mathbf{x}(t)-% \beta(t)\Delta t\nabla f(\mathbf{x}(t))⟹ bold_x ( italic_t + roman_Δ italic_t ) = bold_x ( italic_t ) - italic_β ( italic_t ) roman_Δ italic_t ∇ italic_f ( bold_x ( italic_t ) ) 𝐱(t+Δt)𝐱(t)Δt=β(t)f(𝐱(t))𝐱𝑡Δ𝑡𝐱𝑡Δ𝑡𝛽𝑡𝑓𝐱𝑡\displaystyle\qquad\Longrightarrow\qquad\frac{\mathbf{x}(t+\Delta t)-\mathbf{x% }(t)}{\Delta t}=-\beta(t)\nabla f(\mathbf{x}(t))⟹ divide start_ARG bold_x ( italic_t + roman_Δ italic_t ) - bold_x ( italic_t ) end_ARG start_ARG roman_Δ italic_t end_ARG = - italic_β ( italic_t ) ∇ italic_f ( bold_x ( italic_t ) ) d𝐱(t)dt=β(t)f(𝐱(t)).𝑑𝐱𝑡𝑑𝑡𝛽𝑡𝑓𝐱𝑡\displaystyle\qquad\Longrightarrow\qquad\frac{d\mathbf{x}(t)}{dt}=-\beta(t)% \nabla f(\mathbf{x}(t)).⟹ divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = - italic_β ( italic_t ) ∇ italic_f ( bold_x ( italic_t ) ) . (93) The ordinary differential equation shown on the right has a solution trajectory 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ). This 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) is known as the gradient flow of the function f𝑓fitalic_f. For simplicity, we can make β(t)=β𝛽𝑡𝛽\beta(t)=\betaitalic_β ( italic_t ) = italic_β for all t𝑡titalic_t. Then there are two simple facts about this ODE. First, we can show that ddtf(𝐱(t))𝑑𝑑𝑡𝑓𝐱𝑡\displaystyle\frac{d}{dt}f(\mathbf{x}(t))divide start_ARG italic_d end_ARG start_ARG italic_d italic_t end_ARG italic_f ( bold_x ( italic_t ) ) =f(𝐱(t))Td𝐱(t)dtabsent𝑓superscript𝐱𝑡𝑇𝑑𝐱𝑡𝑑𝑡\displaystyle=\nabla f(\mathbf{x}(t))^{T}\frac{d\mathbf{x}(t)}{dt}= ∇ italic_f ( bold_x ( italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG (chain rule)chain rule\displaystyle(\text{chain rule})( chain rule ) =f(𝐱(t))T[βf(𝐱(t))]absent𝑓superscript𝐱𝑡𝑇delimited-[]𝛽𝑓𝐱𝑡\displaystyle=\nabla f(\mathbf{x}(t))^{T}\left[-\beta\nabla f(\mathbf{x}(t))\right]= ∇ italic_f ( bold_x ( italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ - italic_β ∇ italic_f ( bold_x ( italic_t ) ) ] (Eqn (93))Eqn (93)\displaystyle(\text{Eqn \eqref{eq: GD ODE main}})( Eqn ( ) ) =βf(𝐱(t))Tf(𝐱(t))absent𝛽𝑓superscript𝐱𝑡𝑇𝑓𝐱𝑡\displaystyle=-\beta\nabla f(\mathbf{x}(t))^{T}\nabla f(\mathbf{x}(t))= - italic_β ∇ italic_f ( bold_x ( italic_t ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ italic_f ( bold_x ( italic_t ) ) =βf(𝐱(t))20absent𝛽superscriptnorm𝑓𝐱𝑡20\displaystyle=-\beta\|\nabla f(\mathbf{x}(t))\|^{2}\leq 0= - italic_β ∥ ∇ italic_f ( bold_x ( italic_t ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 0 (norm-squares).norm-squares\displaystyle(\text{norm-squares}).( norm-squares ) . Therefore, as we move from 𝐱i1subscript𝐱𝑖1\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT to 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, the objective value f(𝐱(t))𝑓𝐱𝑡f(\mathbf{x}(t))italic_f ( bold_x ( italic_t ) ) has to go down. This is consistent with our expectation because a gradient descent algorithm should bring the cost down as the iteration goes on. Second, at the limit when t𝑡t\rightarrow\inftyitalic_t → ∞, we know that d𝐱(t)dt0𝑑𝐱𝑡𝑑𝑡0\frac{d\mathbf{x}(t)}{dt}\rightarrow 0divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG → 0. Hence, d𝐱(t)dt=f(𝐱(t))𝑑𝐱𝑡𝑑𝑡𝑓𝐱𝑡\frac{d\mathbf{x}(t)}{dt}=-\nabla f(\mathbf{x}(t))divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = - ∇ italic_f ( bold_x ( italic_t ) ) will imply that f(𝐱(t))0,as t.formulae-sequence𝑓𝐱𝑡0as 𝑡\nabla f(\mathbf{x}(t))\rightarrow 0,\qquad\text{as }t\rightarrow\infty.∇ italic_f ( bold_x ( italic_t ) ) → 0 , as italic_t → ∞ . (94) Therefore, the solution trajectory 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) will approach the minimizer for the function f𝑓fitalic_f.

    Forward and Backward Updates.

    경사 하강 예제를 사용하여 ODE의 한 가지 측면을 더 설명한다. Eqn(92)으로 돌아가면, 우리는 재귀가 동등하게 다음과 같이 기록될 수 있음을 인식한다(β(t)=β)\beta(t)=\beta)italic_β ( italic_t ) = italic_β )라고 가정함):

    𝐱i𝐱i1Δ𝐱=βi1βΔtf(𝐱i1)d𝐱=βf(𝐱)dt,Δ𝐱subscript𝐱𝑖subscript𝐱𝑖1𝛽Δ𝑡subscript𝛽𝑖1𝑓subscript𝐱𝑖1𝑑𝐱𝛽𝑓𝐱𝑑𝑡\underset{\Delta\mathbf{x}}{\underbrace{\mathbf{x}_{i}-\mathbf{x}_{i-1}}}=-% \underset{\beta\Delta t}{\underbrace{\beta_{i-1}}}\nabla f(\mathbf{x}_{i-1})\;% \;\Rightarrow\;\;d\mathbf{x}=-\beta\nabla f(\mathbf{x})dt,start_UNDERACCENT roman_Δ bold_x end_UNDERACCENT start_ARG under⏟ start_ARG bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG end_ARG = - start_UNDERACCENT italic_β roman_Δ italic_t end_UNDERACCENT start_ARG under⏟ start_ARG italic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_ARG end_ARG ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ⇒ italic_d bold_x = - italic_β ∇ italic_f ( bold_x ) italic_d italic_t , (95)

    where the continuous equation holds when we set Δt0Δ𝑡0\Delta t\rightarrow 0roman_Δ italic_t → 0 and Δ𝐱0Δ𝐱0\Delta\mathbf{x}\rightarrow 0roman_Δ bold_x → 0. The interesting point about this equality is that it gives us a summary of the update Δ𝐱Δ𝐱\Delta\mathbf{x}roman_Δ bold_x by writing it in terms of dt𝑑𝑡dtitalic_d italic_t. It says that if we move the along the time axis by dt𝑑𝑡dtitalic_d italic_t, then the solution 𝐱𝐱\mathbf{x}bold_x will be updated by d𝐱𝑑𝐱d\mathbf{x}italic_d bold_x.

    Eqn(95)은 changes 간의 관계를 정의한다. 반복의 시퀀스 i=1,2,,N𝑖12𝑁i=1,2,\ldots,Nitalic_i = 1 , 2 , … , italic_N를 고려하고 반복의 진행이 Eqn(95)을 따른다고 하면 쓸 수 있다.

    (forward)𝐱i=𝐱i1+Δ𝐱i1(forward)subscript𝐱𝑖subscript𝐱𝑖1Δsubscript𝐱𝑖1\displaystyle\text{(forward)}\hskip 56.9055pt\mathbf{x}_{i}=\mathbf{x}_{i-1}+% \Delta\mathbf{x}_{i-1}(forward) bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + roman_Δ bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT 𝐱i1+d𝐱absentsubscript𝐱𝑖1𝑑𝐱\displaystyle\approx\mathbf{x}_{i-1}+d\mathbf{x}≈ bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_d bold_x
    =𝐱i1f(𝐱i1)βdtabsentsubscript𝐱𝑖1𝑓subscript𝐱𝑖1𝛽𝑑𝑡\displaystyle=\mathbf{x}_{i-1}-\nabla f(\mathbf{x}_{i-1})\beta dt= bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) italic_β italic_d italic_t
    𝐱i1βi1f(𝐱i1).absentsubscript𝐱𝑖1subscript𝛽𝑖1𝑓subscript𝐱𝑖1\displaystyle\approx\mathbf{x}_{i-1}-\beta_{i-1}\nabla f(\mathbf{x}_{i-1}).≈ bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_β start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) .

    We call this as the forward equation because we update 𝐱𝐱\mathbf{x}bold_x by 𝐱+Δ𝐱𝐱Δ𝐱\mathbf{x}+\Delta\mathbf{x}bold_x + roman_Δ bold_x assuming that tt+Δt𝑡𝑡Δ𝑡t\leftarrow t+\Delta titalic_t ← italic_t + roman_Δ italic_t.

    이제, i=N,N1,,2,1𝑖𝑁𝑁121i=N,N-1,\ldots,2,1italic_i = italic_N , italic_N - 1 , … , 2 , 1를 반복하는 시퀀스를 생각해 보자. 반복의 진행이 Eqn(95)을 따른다고 하면 시간-역 반복은 다음과 같다.

    (reverse)𝐱i1=𝐱iΔ𝐱i(reverse)subscript𝐱𝑖1subscript𝐱𝑖Δsubscript𝐱𝑖\displaystyle\text{(reverse)}\hskip 56.9055pt\mathbf{x}_{i-1}=\mathbf{x}_{i}-% \Delta\mathbf{x}_{i}(reverse) bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - roman_Δ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 𝐱i+d𝐱absentsubscript𝐱𝑖𝑑𝐱\displaystyle\approx\mathbf{x}_{i}+d\mathbf{x}≈ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_d bold_x
    =𝐱i+βf(𝐱i)dtabsentsubscript𝐱𝑖𝛽𝑓subscript𝐱𝑖𝑑𝑡\displaystyle=\mathbf{x}_{i}+\beta\nabla f(\mathbf{x}_{i})dt= bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) italic_d italic_t
    𝐱i+βif(𝐱i).absentsubscript𝐱𝑖subscript𝛽𝑖𝑓subscript𝐱𝑖\displaystyle\approx\mathbf{x}_{i}+\beta_{i}\nabla f(\mathbf{x}_{i}).≈ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) .

    진행 방향을 반전할 때 부호 변화에 주목하라. 이를 reverse 식이라고 합니다.

    4.2 Forward and Backward Iterations in SDE

    The concept of differential equation for diffusion is not too far from the above gradient descent algorithm. If we introduce a noise term 𝐳t𝒩(0,𝐈)similar-tosubscript𝐳𝑡𝒩0𝐈\mathbf{z}_{t}\sim\mathcal{N}(0,\mathbf{I})bold_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) to the gradient descent algorithm, then the ODE will become a stochastic differential equation (SDE). To see this, we just follow the same discretization scheme by defining 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) as a continuous function for 0t10𝑡10\leq t\leq 10 ≤ italic_t ≤ 1. Suppose that there are N𝑁Nitalic_N steps in the interval so that the interval [0,1]01[0,1][ 0 , 1 ] can be divided into a sequence {iN|i=0,,N1}conditional-set𝑖𝑁𝑖0𝑁1\{\tfrac{i}{N}\,|\,i=0,\ldots,N-1\}{ divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG | italic_i = 0 , … , italic_N - 1 }. The discretization will give us 𝐱i=𝐱(iN)subscript𝐱𝑖𝐱𝑖𝑁\mathbf{x}_{i}=\mathbf{x}(\tfrac{i}{N})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x ( divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG ), and 𝐱i1=𝐱(i1N)subscript𝐱𝑖1𝐱𝑖1𝑁\mathbf{x}_{i-1}=\mathbf{x}(\tfrac{i-1}{N})bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = bold_x ( divide start_ARG italic_i - 1 end_ARG start_ARG italic_N end_ARG ). The interval step is Δt=1NΔ𝑡1𝑁\Delta t=\tfrac{1}{N}roman_Δ italic_t = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG, and the set of all t𝑡titalic_t’s is t{0,1N,,N1N}𝑡01𝑁𝑁1𝑁t\in\{0,\tfrac{1}{N},\ldots,\tfrac{N-1}{N}\}italic_t ∈ { 0 , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG , … , divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG }. Using these definitions, we can write

    𝐱isubscript𝐱𝑖\displaystyle\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =𝐱i1τf(𝐱i1)+𝐳i1absentsubscript𝐱𝑖1𝜏𝑓subscript𝐱𝑖1subscript𝐳𝑖1\displaystyle=\mathbf{x}_{i-1}-\tau\nabla f(\mathbf{x}_{i-1})+\mathbf{z}_{i-1}= bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - italic_τ ∇ italic_f ( bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
    \displaystyle\qquad\Longrightarrow\qquad 𝐱(t+Δt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t+\Delta t)bold_x ( italic_t + roman_Δ italic_t ) =𝐱(t)τf(𝐱(t))+𝐳(t).absent𝐱𝑡𝜏𝑓𝐱𝑡𝐳𝑡\displaystyle=\mathbf{x}(t)-\tau\nabla f(\mathbf{x}(t))+\mathbf{z}(t).= bold_x ( italic_t ) - italic_τ ∇ italic_f ( bold_x ( italic_t ) ) + bold_z ( italic_t ) .

    Now, let’s define a random process 𝐰(t)𝐰𝑡\mathbf{w}(t)bold_w ( italic_t ) such that 𝐳(t)=𝐰(t+Δt)𝐰(t)d𝐰(t)dtΔt𝐳𝑡𝐰𝑡Δ𝑡𝐰𝑡𝑑𝐰𝑡𝑑𝑡Δ𝑡\mathbf{z}(t)=\mathbf{w}(t+\Delta t)-\mathbf{w}(t)\approx\frac{d\mathbf{w}(t)}% {dt}\Delta tbold_z ( italic_t ) = bold_w ( italic_t + roman_Δ italic_t ) - bold_w ( italic_t ) ≈ divide start_ARG italic_d bold_w ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG roman_Δ italic_t for a very small ΔtΔ𝑡\Delta troman_Δ italic_t. In computation, we can generate such a 𝐰(t)𝐰𝑡\mathbf{w}(t)bold_w ( italic_t ) by integrating 𝐳(t)𝐳𝑡\mathbf{z}(t)bold_z ( italic_t ) (which is a Wiener process). With 𝐰(t)𝐰𝑡\mathbf{w}(t)bold_w ( italic_t ) defined, we can write

    𝐱(t+Δt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t+\Delta t)bold_x ( italic_t + roman_Δ italic_t ) =𝐱(t)τf(𝐱(t))+𝐳(t)absent𝐱𝑡𝜏𝑓𝐱𝑡𝐳𝑡\displaystyle=\mathbf{x}(t)-\tau\nabla f(\mathbf{x}(t))+\mathbf{z}(t)= bold_x ( italic_t ) - italic_τ ∇ italic_f ( bold_x ( italic_t ) ) + bold_z ( italic_t )
    \displaystyle\qquad\Longrightarrow\qquad 𝐱(t+Δt)𝐱(t)𝐱𝑡Δ𝑡𝐱𝑡\displaystyle\mathbf{x}(t+\Delta t)-\mathbf{x}(t)bold_x ( italic_t + roman_Δ italic_t ) - bold_x ( italic_t ) =τf(𝐱(t))+𝐰(t+Δt)𝐰(t)absent𝜏𝑓𝐱𝑡𝐰𝑡Δ𝑡𝐰𝑡\displaystyle=-\tau\nabla f(\mathbf{x}(t))+\mathbf{w}(t+\Delta t)-\mathbf{w}(t)= - italic_τ ∇ italic_f ( bold_x ( italic_t ) ) + bold_w ( italic_t + roman_Δ italic_t ) - bold_w ( italic_t )
    \displaystyle\qquad\Longrightarrow\qquad d𝐱𝑑𝐱\displaystyle d\mathbf{x}italic_d bold_x =τf(𝐱)dt+d𝐰.absent𝜏𝑓𝐱𝑑𝑡𝑑𝐰\displaystyle=-\tau\nabla f(\mathbf{x})dt+d\mathbf{w}.= - italic_τ ∇ italic_f ( bold_x ) italic_d italic_t + italic_d bold_w .

    The equation above reveals a generic form of the SDE. We summarize it as follows. Forward Diffusion. d𝐱=𝐟(𝐱,t)driftdt+g(t)diffusiond𝐰.𝑑𝐱drift𝐟𝐱𝑡𝑑𝑡diffusion𝑔𝑡𝑑𝐰d\mathbf{x}=\underset{\text{drift}}{\underbrace{\mathbf{f}(\mathbf{x},t)}}\;dt% +\underset{\text{diffusion}}{\underbrace{g(t)}}\;d\mathbf{w}.italic_d bold_x = underdrift start_ARG under⏟ start_ARG bold_f ( bold_x , italic_t ) end_ARG end_ARG italic_d italic_t + underdiffusion start_ARG under⏟ start_ARG italic_g ( italic_t ) end_ARG end_ARG italic_d bold_w . (96)

    𝐟(𝐱,t)𝐟𝐱𝑡\mathbf{f}(\mathbf{x},t)bold_f ( bold_x , italic_t )g(t)𝑔𝑡g(t)italic_g ( italic_t )라는 두 용어는 물리적인 의미를 지니고 있다. 드래프트 계수는 무작위 효과가 없을 때 닫힌 시스템의 분자가 이동하는 방법을 정의하는 벡터 값 함수 𝐟(𝐱,t)𝐟𝐱𝑡\mathbf{f}(\mathbf{x},t)bold_f ( bold_x , italic_t )이다. 경사 하강 알고리즘의 경우 드리프트는 목적 함수의 음의 기울기에 의해 정의된다. 즉, 우리는 해 궤적이 목표의 기울기를 따르기를 원한다.

    확산 계수 g(t)𝑔𝑡g(t)italic_g ( italic_t )는 분자가 한 위치에서 다른 위치로 무작위로 걷는 방법을 설명하는 스칼라 함수이다. 함수 g(t)𝑔𝑡g(t)italic_g ( italic_t )는 랜덤 이동이 얼마나 강한지를 결정한다.

    Example. Consider the equation d𝐱=ad𝐰,𝑑𝐱𝑎𝑑𝐰d\mathbf{x}=ad\mathbf{w},italic_d bold_x = italic_a italic_d bold_w , where a=0.05𝑎0.05a=0.05italic_a = 0.05. The iterative scheme can be written as 𝐱i𝐱i1=a(𝐰i𝐰i1)=def𝐳i1𝒩(0,𝐈)𝐱i=𝐱i1+a𝐳i.formulae-sequencesubscript𝐱𝑖subscript𝐱𝑖1𝑎similar-todefsubscript𝐳𝑖1𝒩0𝐈subscript𝐰𝑖subscript𝐰𝑖1subscript𝐱𝑖subscript𝐱𝑖1𝑎subscript𝐳𝑖\displaystyle\mathbf{x}_{i}-\mathbf{x}_{i-1}=a\underset{\overset{\text{def}}{=% }\mathbf{z}_{i-1}\sim\mathcal{N}(0,\mathbf{I})}{\underbrace{(\mathbf{w}_{i}-% \mathbf{w}_{i-1})}}\quad\Rightarrow\quad\mathbf{x}_{i}=\mathbf{x}_{i-1}+a% \mathbf{z}_{i}.bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = italic_a start_UNDERACCENT overdef start_ARG = end_ARG bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) end_UNDERACCENT start_ARG under⏟ start_ARG ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) end_ARG end_ARG ⇒ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_a bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . We can plot the function 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as below. The initial point 𝐱0=0subscript𝐱00\mathbf{x}_{0}=0bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 is marked as in red to indicate that the process is moving forward in time.[Uncaptioned image]

    Remark. 보시다시피, 차분 d𝐰=𝐰i𝐰i1𝑑𝐰subscript𝐰𝑖subscript𝐰𝑖1d\mathbf{w}=\mathbf{w}_{i}-\mathbf{w}_{i-1}italic_d bold_w = bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT는 백색 가우시안 벡터인 위너 프로세스로 정의된다. 개별 𝐰isubscript𝐰𝑖\mathbf{w}_{i}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT는 가우시안인 것이 아니라, 차이 𝐰i𝐰i1subscript𝐰𝑖subscript𝐰𝑖1\mathbf{w}_{i}-\mathbf{w}_{i-1}bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT는 가우시안이다.

    Example. Consider the equation d𝐱=α2𝐱dt+βd𝐰,𝑑𝐱𝛼2𝐱𝑑𝑡𝛽𝑑𝐰d\mathbf{x}=-\frac{\alpha}{2}\mathbf{x}dt+\beta d\mathbf{w},italic_d bold_x = - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG bold_x italic_d italic_t + italic_β italic_d bold_w , where α=1𝛼1\alpha=1italic_α = 1 and β=0.1𝛽0.1\beta=0.1italic_β = 0.1. This equation can be written as 𝐱i𝐱i1=α2𝐱i1+β(𝐰i𝐰i1)=def𝐳i1𝒩(0,𝐈)𝐱i=(1α2)𝐱i1+β𝐳i1.formulae-sequencesubscript𝐱𝑖subscript𝐱𝑖1𝛼2subscript𝐱𝑖1𝛽similar-todefsubscript𝐳𝑖1𝒩0𝐈subscript𝐰𝑖subscript𝐰𝑖1subscript𝐱𝑖1𝛼2subscript𝐱𝑖1𝛽subscript𝐳𝑖1\displaystyle\mathbf{x}_{i}-\mathbf{x}_{i-1}=-\frac{\alpha}{2}\mathbf{x}_{i-1}% +\beta\underset{\overset{\text{def}}{=}\mathbf{z}_{i-1}\sim\mathcal{N}(0,% \mathbf{I})}{\underbrace{(\mathbf{w}_{i}-\mathbf{w}_{i-1})}}\quad\Rightarrow% \quad\mathbf{x}_{i}=\left(1-\frac{\alpha}{2}\right)\mathbf{x}_{i-1}+\beta% \mathbf{z}_{i-1}.bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_β start_UNDERACCENT overdef start_ARG = end_ARG bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) end_UNDERACCENT start_ARG under⏟ start_ARG ( bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) end_ARG end_ARG ⇒ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( 1 - divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ) bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + italic_β bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT . We can plot the function 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as below.[Uncaptioned image]

    확산 방정식의 역 방향은 시간적으로 뒤로 이동하는 것이다. Anderson [17]에 따른 역방향-시간 SDE는 다음과 같이 주어진다. Reverse SDE. Example. Consider the reverse diffusion equation d𝐱=ad𝐰¯.𝑑𝐱𝑎𝑑¯𝐰d\mathbf{x}=ad\overline{\mathbf{w}}.italic_d bold_x = italic_a italic_d over¯ start_ARG bold_w end_ARG . (98) We can write the discrete-time recursion as follows. For i=N,N1,,1𝑖𝑁𝑁11i=N,N-1,\ldots,1italic_i = italic_N , italic_N - 1 , … , 1, do 𝐱i1=𝐱i+a(𝐰i1𝐰i)=𝐳i=𝐱i+a𝐳i,𝐳i𝒩(0,𝐈).formulae-sequencesubscript𝐱𝑖1subscript𝐱𝑖𝑎absentsubscript𝐳𝑖subscript𝐰𝑖1subscript𝐰𝑖subscript𝐱𝑖𝑎subscript𝐳𝑖similar-tosubscript𝐳𝑖𝒩0𝐈\displaystyle\mathbf{x}_{i-1}=\mathbf{x}_{i}+a\underset{=\mathbf{z}_{i}}{% \underbrace{(\mathbf{w}_{i-1}-\mathbf{w}_{i})}}=\mathbf{x}_{i}+a\mathbf{z}_{i}% ,\quad\mathbf{z}_{i}\sim\mathcal{N}(0,\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a start_UNDERACCENT = bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG ( bold_w start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG end_ARG = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_a bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) . In the figure below we show the trajectory of this reverse-time process. Note that the initial point marked in red is at 𝐱Nsubscript𝐱𝑁\mathbf{x}_{N}bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT. The process is tracked backward to 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.[Uncaptioned image]

    4.3 Stochastic Differential Equation for DDPM

    DDPM과 SDE 사이의 연결을 도출하기 위해 이산 시간 DDPM 반복을 고려한다. i=1,2,,N𝑖12𝑁i=1,2,\ldots,Nitalic_i = 1 , 2 , … , italic_N

    𝐱i=1βi𝐱i1+βi𝐳i1,𝐳i1𝒩(0,𝐈).formulae-sequencesubscript𝐱𝑖1subscript𝛽𝑖subscript𝐱𝑖1subscript𝛽𝑖subscript𝐳𝑖1similar-tosubscript𝐳𝑖1𝒩0𝐈\mathbf{x}_{i}=\sqrt{1-\beta_{i}}\mathbf{x}_{i-1}+\sqrt{\beta_{i}}\mathbf{z}_{% i-1},\qquad\mathbf{z}_{i-1}\sim\mathcal{N}(0,\mathbf{I}).bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) . (99)

    We can show that this equation can be derived from the forward SDE equation below. The forward sampling equation of DDPM can be written as an SDE via d𝐱=β(t)2𝐱=𝐟(𝐱,t)dt+β(t)=g(t)d𝐰.𝑑𝐱absent𝐟𝐱𝑡𝛽𝑡2𝐱𝑑𝑡absent𝑔𝑡𝛽𝑡𝑑𝐰d\mathbf{x}=\underset{=\mathbf{f}(\mathbf{x},t)}{\underbrace{-\frac{\beta(t)}{% 2}\;\mathbf{x}}}\;dt+\underset{=g(t)}{\underbrace{\sqrt{\beta(t)}}}d\mathbf{w}.italic_d bold_x = start_UNDERACCENT = bold_f ( bold_x , italic_t ) end_UNDERACCENT start_ARG under⏟ start_ARG - divide start_ARG italic_β ( italic_t ) end_ARG start_ARG 2 end_ARG bold_x end_ARG end_ARG italic_d italic_t + start_UNDERACCENT = italic_g ( italic_t ) end_UNDERACCENT start_ARG under⏟ start_ARG square-root start_ARG italic_β ( italic_t ) end_ARG end_ARG end_ARG italic_d bold_w . (100)

    이 경우인 이유를 알아보기 위해, 스텝 사이즈 Δt=1NΔ𝑡1𝑁\Delta t=\tfrac{1}{N}roman_Δ italic_t = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG를 정의하고, 보조 노이즈 레벨 {β¯i}i=1Nsuperscriptsubscriptsubscript¯𝛽𝑖𝑖1𝑁\{\overline{\beta}_{i}\}_{i=1}^{N}{ over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT 여기서 βi=β¯iNsubscript𝛽𝑖subscript¯𝛽𝑖𝑁\beta_{i}=\tfrac{\overline{\beta}_{i}}{N}italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG를 고려한다. 그럼

    βi=β(iN)β¯i1N=β(t+Δt)Δt,subscript𝛽𝑖subscript¯𝛽𝑖𝛽𝑖𝑁1𝑁𝛽𝑡Δ𝑡Δ𝑡\beta_{i}=\underset{\overline{\beta}_{i}}{\underbrace{\beta\left(\tfrac{i}{N}% \right)}}\cdot\frac{1}{N}=\beta(t+\Delta t)\Delta t,italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = start_UNDERACCENT over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_UNDERACCENT start_ARG under⏟ start_ARG italic_β ( divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG ) end_ARG end_ARG ⋅ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG = italic_β ( italic_t + roman_Δ italic_t ) roman_Δ italic_t ,

    여기서, N𝑁N\rightarrow\inftyitalic_N → ∞, β¯i=β(t)\overline{\beta}_{i}=\rightarrow\beta(t)over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = → italic_β ( italic_t )에 대한 연속 시간 함수라고 가정한다. 유사하게, 우리는 정의한다

    𝐱i=𝐱(iN)=𝐱(t+Δt),𝐳i=𝐳(iN)=𝐳(t+Δt).formulae-sequencesubscript𝐱𝑖𝐱𝑖𝑁𝐱𝑡Δ𝑡subscript𝐳𝑖𝐳𝑖𝑁𝐳𝑡Δ𝑡\displaystyle\mathbf{x}_{i}=\mathbf{x}\left(\tfrac{i}{N}\right)=\mathbf{x}(t+% \Delta t),\quad\mathbf{z}_{i}=\mathbf{z}\left(\tfrac{i}{N}\right)=\mathbf{z}(t% +\Delta t).bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x ( divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG ) = bold_x ( italic_t + roman_Δ italic_t ) , bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_z ( divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG ) = bold_z ( italic_t + roman_Δ italic_t ) .

    따라서 우리는

    𝐱isubscript𝐱𝑖\displaystyle\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1βi𝐱i1+βi𝐳i1absent1subscript𝛽𝑖subscript𝐱𝑖1subscript𝛽𝑖subscript𝐳𝑖1\displaystyle=\sqrt{1-\beta_{i}}\mathbf{x}_{i-1}+\sqrt{\beta_{i}}\mathbf{z}_{i% -1}= square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
    \displaystyle\Rightarrow\qquad 𝐱isubscript𝐱𝑖\displaystyle\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =1β¯iN𝐱i1+β¯iN𝐳i1absent1subscript¯𝛽𝑖𝑁subscript𝐱𝑖1subscript¯𝛽𝑖𝑁subscript𝐳𝑖1\displaystyle=\sqrt{1-\tfrac{\overline{\beta}_{i}}{N}}\mathbf{x}_{i-1}+\sqrt{% \tfrac{\overline{\beta}_{i}}{N}}\mathbf{z}_{i-1}= square-root start_ARG 1 - divide start_ARG over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG end_ARG bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + square-root start_ARG divide start_ARG over¯ start_ARG italic_β end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG end_ARG bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT
    \displaystyle\Rightarrow\qquad 𝐱(t+Δt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t+\Delta t)bold_x ( italic_t + roman_Δ italic_t ) =1β(t+Δt)Δt𝐱(t)+β(t+Δt)Δt𝐳(t)absent1𝛽𝑡Δ𝑡Δ𝑡𝐱𝑡𝛽𝑡Δ𝑡Δ𝑡𝐳𝑡\displaystyle=\sqrt{1-\beta(t+\Delta t)\cdot\Delta t}\;\mathbf{x}(t)+\sqrt{% \beta(t+\Delta t)\cdot\Delta t}\;\mathbf{z}(t)= square-root start_ARG 1 - italic_β ( italic_t + roman_Δ italic_t ) ⋅ roman_Δ italic_t end_ARG bold_x ( italic_t ) + square-root start_ARG italic_β ( italic_t + roman_Δ italic_t ) ⋅ roman_Δ italic_t end_ARG bold_z ( italic_t )
    \displaystyle\Rightarrow\qquad 𝐱(t+Δt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t+\Delta t)bold_x ( italic_t + roman_Δ italic_t ) (112β(t+Δt)Δt)𝐱(t)+β(t+Δt)Δt𝐳(t)absent112𝛽𝑡Δ𝑡Δ𝑡𝐱𝑡𝛽𝑡Δ𝑡Δ𝑡𝐳𝑡\displaystyle\approx\left(1-\frac{1}{2}\beta(t+\Delta t)\cdot\Delta t\right)\;% \mathbf{x}(t)+\sqrt{\beta(t+\Delta t)\cdot\Delta t}\;\mathbf{z}(t)≈ ( 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_β ( italic_t + roman_Δ italic_t ) ⋅ roman_Δ italic_t ) bold_x ( italic_t ) + square-root start_ARG italic_β ( italic_t + roman_Δ italic_t ) ⋅ roman_Δ italic_t end_ARG bold_z ( italic_t )
    \displaystyle\Rightarrow\qquad 𝐱(t+Δt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t+\Delta t)bold_x ( italic_t + roman_Δ italic_t ) 𝐱(t)12β(t)Δt𝐱(t)+β(t)Δt𝐳(t).absent𝐱𝑡12𝛽𝑡Δ𝑡𝐱𝑡𝛽𝑡Δ𝑡𝐳𝑡\displaystyle\approx\mathbf{x}(t)-\frac{1}{2}\beta(t)\Delta t\;\mathbf{x}(t)+% \sqrt{\beta(t)\cdot\Delta t}\;\mathbf{z}(t).≈ bold_x ( italic_t ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_β ( italic_t ) roman_Δ italic_t bold_x ( italic_t ) + square-root start_ARG italic_β ( italic_t ) ⋅ roman_Δ italic_t end_ARG bold_z ( italic_t ) .

    Thus, as Δt0Δ𝑡0\Delta t\rightarrow 0roman_Δ italic_t → 0, we have

    d𝐱=12β(t)𝐱dt+β(t)d𝐰.𝑑𝐱12𝛽𝑡𝐱𝑑𝑡𝛽𝑡𝑑𝐰d\mathbf{x}=-\frac{1}{2}\beta(t)\mathbf{x}dt+\sqrt{\beta(t)}\;d\mathbf{w}.italic_d bold_x = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_β ( italic_t ) bold_x italic_d italic_t + square-root start_ARG italic_β ( italic_t ) end_ARG italic_d bold_w . (101)

    따라서 DDPM 포워드 업데이트 반복이 SDE로 동등하게 작성될 수 있음을 보였다.

    DDPM 순방향 업데이트 반복을 SDE로 작성할 수 있다는 것은 SDE를 풀어서 DDPM 추정치를 결정할 수 있다는 것을 의미한다. 즉, 적절하게 정의된 SDE 솔버를 위해, 우리는 SDE를 솔버 안으로 던질 수 있다. 적절하게 선택된 풀이자가 반환하는 해는 DDPM 추정치일 것이다. 물론, 우리는 DDPM 반복 자체가 SDE를 해결하고 있기 때문에 SDE 해결기를 사용할 필요가 없다. DDPM 반복은 단지 1차 방법이기 때문에 최상의 SDE 해결사는 아닐 수 있다. 그럼에도 불구하고 SDE 솔버를 사용하는 데 관심이 없다면 여전히 DDPM 반복을 사용하여 솔루션을 얻을 수 있다. 예를 들어 설명하면 다음과 같다.

    Example. Consider the DDPM forward equation with βi=0.05subscript𝛽𝑖0.05\beta_{i}=0.05italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0.05 for all i=0,,N1𝑖0𝑁1i=0,\ldots,N-1italic_i = 0 , … , italic_N - 1. We initialize the sample 𝐱0subscript𝐱0\mathbf{x}_{0}bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT by drawing it from a Gaussian mixture such that 𝐱0k=1Kπk𝒩(𝐱0|𝝁k,σk2𝐈),similar-tosubscript𝐱0superscriptsubscript𝑘1𝐾subscript𝜋𝑘𝒩conditionalsubscript𝐱0subscript𝝁𝑘superscriptsubscript𝜎𝑘2𝐈\mathbf{x}_{0}\sim\sum_{k=1}^{K}\pi_{k}\mathcal{N}(\mathbf{x}_{0}|\boldsymbol{% \mu}_{k},\sigma_{k}^{2}\mathbf{I}),bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT caligraphic_N ( bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | bold_italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) , where π1=π2=0.5subscript𝜋1subscript𝜋20.5\pi_{1}=\pi_{2}=0.5italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.5, σ1=σ2=1subscript𝜎1subscript𝜎21\sigma_{1}=\sigma_{2}=1italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, 𝝁1=3subscript𝝁13\boldsymbol{\mu}_{1}=3bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 3 and 𝝁2=3subscript𝝁23\boldsymbol{\mu}_{2}=-3bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = - 3. Then, using the equation 𝐱i=1βi𝐱i1+βi𝐳i1,𝐳i1𝒩(0,𝐈),formulae-sequencesubscript𝐱𝑖1subscript𝛽𝑖subscript𝐱𝑖1subscript𝛽𝑖subscript𝐳𝑖1similar-tosubscript𝐳𝑖1𝒩0𝐈\mathbf{x}_{i}=\sqrt{1-\beta_{i}}\mathbf{x}_{i-1}+\sqrt{\beta_{i}}\mathbf{z}_{% i-1},\qquad\mathbf{z}_{i-1}\sim\mathcal{N}(0,\mathbf{I}),bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ) , we can plot the trajectory and the distribution as follows.[Uncaptioned image]

    The reverse diffusion equation follows from Eqn (97) by substituting the appropriate quantities: 𝐟(𝐱,t)=β(t)2𝐟𝐱𝑡𝛽𝑡2\mathbf{f}(\mathbf{x},t)=-\frac{\beta(t)}{2}bold_f ( bold_x , italic_t ) = - divide start_ARG italic_β ( italic_t ) end_ARG start_ARG 2 end_ARG and g(t)=β(t)𝑔𝑡𝛽𝑡g(t)=\sqrt{\beta(t)}italic_g ( italic_t ) = square-root start_ARG italic_β ( italic_t ) end_ARG. This will give us

    d𝐱𝑑𝐱\displaystyle d\mathbf{x}italic_d bold_x =[𝐟(𝐱,t)g(t)2𝐱logpt(𝐱)]dt+g(t)d𝐰¯absentdelimited-[]𝐟𝐱𝑡𝑔superscript𝑡2subscript𝐱subscript𝑝𝑡𝐱𝑑𝑡𝑔𝑡𝑑¯𝐰\displaystyle=[\mathbf{f}(\mathbf{x},t)-g(t)^{2}\nabla_{\mathbf{x}}\log p_{t}(% \mathbf{x})]dt+g(t)d\overline{\mathbf{w}}= [ bold_f ( bold_x , italic_t ) - italic_g ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) ] italic_d italic_t + italic_g ( italic_t ) italic_d over¯ start_ARG bold_w end_ARG
    =[β(t)2𝐱β(t)𝐱logpt(𝐱)]dt+β(t)d𝐰¯,absentdelimited-[]𝛽𝑡2𝐱𝛽𝑡subscript𝐱subscript𝑝𝑡𝐱𝑑𝑡𝛽𝑡𝑑¯𝐰\displaystyle=\left[-\frac{\beta(t)}{2}\;\mathbf{x}-\beta(t)\nabla_{\mathbf{x}% }\log p_{t}(\mathbf{x})\right]dt+\sqrt{\beta(t)}d\overline{\mathbf{w}},= [ - divide start_ARG italic_β ( italic_t ) end_ARG start_ARG 2 end_ARG bold_x - italic_β ( italic_t ) ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) ] italic_d italic_t + square-root start_ARG italic_β ( italic_t ) end_ARG italic_d over¯ start_ARG bold_w end_ARG ,

    which will give us the following equation: The reverse sampling equation of DDPM can be written as an SDE via d𝐱=β(t)[𝐱2+𝐱logpt(𝐱)]dt+β(t)d𝐰¯.𝑑𝐱𝛽𝑡delimited-[]𝐱2subscript𝐱subscript𝑝𝑡𝐱𝑑𝑡𝛽𝑡𝑑¯𝐰d\mathbf{x}=-\beta(t)\left[\frac{\mathbf{x}}{2}+\nabla_{\mathbf{x}}\log p_{t}(% \mathbf{x})\right]dt+\sqrt{\beta(t)}d\overline{\mathbf{w}}.italic_d bold_x = - italic_β ( italic_t ) [ divide start_ARG bold_x end_ARG start_ARG 2 end_ARG + ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) ] italic_d italic_t + square-root start_ARG italic_β ( italic_t ) end_ARG italic_d over¯ start_ARG bold_w end_ARG . (102)

    The iterative update scheme can be written by considering d𝐱=𝐱(t)𝐱(tΔt)𝑑𝐱𝐱𝑡𝐱𝑡Δ𝑡d\mathbf{x}=\mathbf{x}(t)-\mathbf{x}(t-\Delta t)italic_d bold_x = bold_x ( italic_t ) - bold_x ( italic_t - roman_Δ italic_t ), and d𝐰¯=𝐰(tΔt)𝐰(t)=𝐳(t)𝑑¯𝐰𝐰𝑡Δ𝑡𝐰𝑡𝐳𝑡d\overline{\mathbf{w}}=\mathbf{w}(t-\Delta t)-\mathbf{w}(t)=-\mathbf{z}(t)italic_d over¯ start_ARG bold_w end_ARG = bold_w ( italic_t - roman_Δ italic_t ) - bold_w ( italic_t ) = - bold_z ( italic_t ). Then, letting dt=Δt𝑑𝑡Δ𝑡dt=\Delta titalic_d italic_t = roman_Δ italic_t, we can show that

    𝐱(t)𝐱(tΔt)𝐱𝑡𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t)-\mathbf{x}(t-\Delta t)bold_x ( italic_t ) - bold_x ( italic_t - roman_Δ italic_t ) =β(t)Δt[𝐱(t)2+𝐱logpt(𝐱(t))]β(t)Δt𝐳(t)absent𝛽𝑡Δ𝑡delimited-[]𝐱𝑡2subscript𝐱subscript𝑝𝑡𝐱𝑡𝛽𝑡Δ𝑡𝐳𝑡\displaystyle=-\beta(t)\Delta t\left[\frac{\mathbf{x}(t)}{2}+\nabla_{\mathbf{x% }}\log p_{t}(\mathbf{x}(t))\right]-\sqrt{\beta(t)\Delta t}\mathbf{z}(t)= - italic_β ( italic_t ) roman_Δ italic_t [ divide start_ARG bold_x ( italic_t ) end_ARG start_ARG 2 end_ARG + ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) ] - square-root start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG bold_z ( italic_t )
    \displaystyle\Rightarrow\quad 𝐱(tΔt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t-\Delta t)bold_x ( italic_t - roman_Δ italic_t ) =𝐱(t)+β(t)Δt[𝐱(t)2+𝐱logpt(𝐱(t))]+β(t)Δt𝐳(t).absent𝐱𝑡𝛽𝑡Δ𝑡delimited-[]𝐱𝑡2subscript𝐱subscript𝑝𝑡𝐱𝑡𝛽𝑡Δ𝑡𝐳𝑡\displaystyle=\mathbf{x}(t)+\beta(t)\Delta t\left[\frac{\mathbf{x}(t)}{2}+% \nabla_{\mathbf{x}}\log p_{t}(\mathbf{x}(t))\right]+\sqrt{\beta(t)\Delta t}% \mathbf{z}(t).= bold_x ( italic_t ) + italic_β ( italic_t ) roman_Δ italic_t [ divide start_ARG bold_x ( italic_t ) end_ARG start_ARG 2 end_ARG + ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) ] + square-root start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG bold_z ( italic_t ) .

    By grouping the terms, and assuming that β(t)Δt1much-less-than𝛽𝑡Δ𝑡1\beta(t)\Delta t\ll 1italic_β ( italic_t ) roman_Δ italic_t ≪ 1, we recognize that

    𝐱(tΔt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t-\Delta t)bold_x ( italic_t - roman_Δ italic_t ) =𝐱(t)[1+β(t)Δt2]+β(t)Δt𝐱logpt(𝐱(t))+β(t)Δt𝐳(t)absent𝐱𝑡delimited-[]1𝛽𝑡Δ𝑡2𝛽𝑡Δ𝑡subscript𝐱subscript𝑝𝑡𝐱𝑡𝛽𝑡Δ𝑡𝐳𝑡\displaystyle=\mathbf{x}(t)\left[1+\frac{\beta(t)\Delta t}{2}\right]+\beta(t)% \Delta t\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x}(t))+\sqrt{\beta(t)\Delta t}% \mathbf{z}(t)= bold_x ( italic_t ) [ 1 + divide start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG start_ARG 2 end_ARG ] + italic_β ( italic_t ) roman_Δ italic_t ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) + square-root start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG bold_z ( italic_t )
    𝐱(t)[1+β(t)Δt2]+β(t)Δt𝐱logpt(𝐱(t))+(β(t)Δt)22𝐱logpt(𝐱(t))+β(t)Δt𝐳(t)absent𝐱𝑡delimited-[]1𝛽𝑡Δ𝑡2𝛽𝑡Δ𝑡subscript𝐱subscript𝑝𝑡𝐱𝑡superscript𝛽𝑡Δ𝑡22subscript𝐱subscript𝑝𝑡𝐱𝑡𝛽𝑡Δ𝑡𝐳𝑡\displaystyle\approx\mathbf{x}(t)\left[1+\frac{\beta(t)\Delta t}{2}\right]+% \beta(t)\Delta t\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x}(t))+{\color[rgb]{% 0,0,1}\tfrac{(\beta(t)\Delta t)^{2}}{2}\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x% }(t))}+\sqrt{\beta(t)\Delta t}\mathbf{z}(t)≈ bold_x ( italic_t ) [ 1 + divide start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG start_ARG 2 end_ARG ] + italic_β ( italic_t ) roman_Δ italic_t ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) + divide start_ARG ( italic_β ( italic_t ) roman_Δ italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) + square-root start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG bold_z ( italic_t )
    =[1+β(t)Δt2](𝐱(t)+β(t)Δt𝐱logpt(𝐱(t)))+β(t)Δt𝐳(t).absentdelimited-[]1𝛽𝑡Δ𝑡2𝐱𝑡𝛽𝑡Δ𝑡subscript𝐱subscript𝑝𝑡𝐱𝑡𝛽𝑡Δ𝑡𝐳𝑡\displaystyle=\left[1+\frac{\beta(t)\Delta t}{2}\right]\Big{(}\mathbf{x}(t)+% \beta(t)\Delta t\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x}(t))\Big{)}+\sqrt{% \beta(t)\Delta t}\mathbf{z}(t).= [ 1 + divide start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG start_ARG 2 end_ARG ] ( bold_x ( italic_t ) + italic_β ( italic_t ) roman_Δ italic_t ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) ) + square-root start_ARG italic_β ( italic_t ) roman_Δ italic_t end_ARG bold_z ( italic_t ) .

    Then, following the discretization scheme by letting t{0,,N1N}𝑡0𝑁1𝑁t\in\{0,\ldots,\frac{N-1}{N}\}italic_t ∈ { 0 , … , divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG }, Δt=1/NΔ𝑡1𝑁\Delta t=1/Nroman_Δ italic_t = 1 / italic_N, 𝐱(tΔt)=𝐱i1𝐱𝑡Δ𝑡subscript𝐱𝑖1\mathbf{x}(t-\Delta t)=\mathbf{x}_{i-1}bold_x ( italic_t - roman_Δ italic_t ) = bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT, 𝐱(t)=𝐱i𝐱𝑡subscript𝐱𝑖\mathbf{x}(t)=\mathbf{x}_{i}bold_x ( italic_t ) = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and β(t)Δt=βi𝛽𝑡Δ𝑡subscript𝛽𝑖\beta(t)\Delta t=\beta_{i}italic_β ( italic_t ) roman_Δ italic_t = italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, we can show that

    𝐱i1subscript𝐱𝑖1\displaystyle\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT =(1+βi2)[𝐱i+βi2𝐱logpi(𝐱i)]+βi𝐳iabsent1subscript𝛽𝑖2delimited-[]subscript𝐱𝑖subscript𝛽𝑖2subscript𝐱subscript𝑝𝑖subscript𝐱𝑖subscript𝛽𝑖subscript𝐳𝑖\displaystyle=(1+\tfrac{\beta_{i}}{2})\Big{[}\mathbf{x}_{i}+\tfrac{\beta_{i}}{% 2}\nabla_{\mathbf{x}}\log p_{i}(\mathbf{x}_{i})\Big{]}+\sqrt{\beta_{i}}\mathbf% {z}_{i}= ( 1 + divide start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ) [ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
    11βi[𝐱i+βi2𝐱logpi(𝐱i)]+βi𝐳i,absent11subscript𝛽𝑖delimited-[]subscript𝐱𝑖subscript𝛽𝑖2subscript𝐱subscript𝑝𝑖subscript𝐱𝑖subscript𝛽𝑖subscript𝐳𝑖\displaystyle\approx\tfrac{1}{\sqrt{1-\beta_{i}}}\Big{[}\mathbf{x}_{i}+\tfrac{% \beta_{i}}{2}\nabla_{\mathbf{x}}\log p_{i}(\mathbf{x}_{i})\Big{]}+\sqrt{\beta_% {i}}\mathbf{z}_{i},≈ divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG [ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (103)

    여기서 pi(𝐱)subscript𝑝𝑖𝐱p_{i}(\mathbf{x})italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x )𝐱𝐱\mathbf{x}bold_x의 시간 i𝑖iitalic_i에서의 확률 밀도 함수이다. 실제 구현을 위해 𝐱logpi(𝐱i)subscript𝐱subscript𝑝𝑖subscript𝐱𝑖\nabla_{\mathbf{x}}\log p_{i}(\mathbf{x}_{i})∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )를 추정 점수 함수 𝐬𝜽(𝐱i)subscript𝐬𝜽subscript𝐱𝑖\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}_{i})bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )로 대체할 수 있다.

    따라서 우리는 [8]에서 Song과 Ermon이 정의한 것과 일치하는 DDPM 반복을 복구했다. 이것은 점수 함수를 사용하여 DDPM의 반복을 연결할 수 있기 때문에 흥미로운 결과이다. Song and Ermon [8] SDE를 variance preserving (VP) SDE라고 불렀다.

    Example. Following from the previous example, we perform the reverse diffusion equation using 𝐱i1=11βi[𝐱i+βi2𝐱logpi(𝐱i)]+βi𝐳i,subscript𝐱𝑖111subscript𝛽𝑖delimited-[]subscript𝐱𝑖subscript𝛽𝑖2subscript𝐱subscript𝑝𝑖subscript𝐱𝑖subscript𝛽𝑖subscript𝐳𝑖\mathbf{x}_{i-1}=\tfrac{1}{\sqrt{1-\beta_{i}}}\Big{[}\mathbf{x}_{i}+\tfrac{% \beta_{i}}{2}\nabla_{\mathbf{x}}\log p_{i}(\mathbf{x}_{i})\Big{]}+\sqrt{\beta_% {i}}\mathbf{z}_{i},bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG [ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , where 𝐳i𝒩(0,𝐈)similar-tosubscript𝐳𝑖𝒩0𝐈\mathbf{z}_{i}\sim\mathcal{N}(0,\mathbf{I})bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , bold_I ). The trajectory of the iterates is shown below.[Uncaptioned image]

    4.4 Stochastic Differential Equation for SMLD

    점수 매칭 랭빈 다이내믹스 모델은 또한 SDE에 의해 기술될 수 있다. 우선, 우리는 SMLD 설정에서 실제로 "전진 확산 단계"가 없다는 것을 알아챘다. 그러나 SMLD 훈련에서 노이즈 스케일을 N𝑁Nitalic_N 수준으로 나누면 재귀가 마르코프 체인을 따라야 한다고 대략적으로 주장할 수 있다.

    𝐱i=𝐱i1+σi2σi12𝐳i1,i=1,2,,N.formulae-sequencesubscript𝐱𝑖subscript𝐱𝑖1superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖12subscript𝐳𝑖1𝑖12𝑁\mathbf{x}_{i}=\mathbf{x}_{i-1}+\sqrt{\sigma_{i}^{2}-\sigma_{i-1}^{2}}\mathbf{% z}_{i-1},\qquad i=1,2,\ldots,N.bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_i = 1 , 2 , … , italic_N . (104)

    이것은 보기에 그리 어렵지 않습니다. 𝐱i1subscript𝐱𝑖1\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT의 분산이 σi12superscriptsubscript𝜎𝑖12\sigma_{i-1}^{2}italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT라고 가정하면, 우리는 다음을 보여줄 수 있다.

    Var[𝐱i]Vardelimited-[]subscript𝐱𝑖\displaystyle\mathrm{Var}[\mathbf{x}_{i}]roman_Var [ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] =Var[𝐱i1]+(σi2σi12)absentVardelimited-[]subscript𝐱𝑖1superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖12\displaystyle=\mathrm{Var}[\mathbf{x}_{i-1}]+(\sigma_{i}^{2}-\sigma_{i-1}^{2})= roman_Var [ bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ] + ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT )
    =σi12+(σi2σi12)=σi2.absentsuperscriptsubscript𝜎𝑖12superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖12superscriptsubscript𝜎𝑖2\displaystyle=\sigma_{i-1}^{2}+(\sigma_{i}^{2}-\sigma_{i-1}^{2})=\sigma_{i}^{2}.= italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

    그러므로, 노이즈 레벨의 시퀀스가 주어지면, Eqn(104)은 실제로 노이즈 통계가 원하는 특성을 만족시킬 수 있도록 추정치 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT를 생성할 것이다.

    If we agree Eqn (104), it is easy to derive the SDE associated with Eqn (104). Assuming that in the limit {σi}i=1Nsuperscriptsubscriptsubscript𝜎𝑖𝑖1𝑁\{\sigma_{i}\}_{i=1}^{N}{ italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT becomes the continuous time σ(t)𝜎𝑡\sigma(t)italic_σ ( italic_t ) for 0t10𝑡10\leq t\leq 10 ≤ italic_t ≤ 1, and {𝐱i}i=1Nsuperscriptsubscriptsubscript𝐱𝑖𝑖1𝑁\{\mathbf{x}_{i}\}_{i=1}^{N}{ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT becomes 𝐱(t)𝐱𝑡\mathbf{x}(t)bold_x ( italic_t ) where 𝐱i=𝐱(iN)subscript𝐱𝑖𝐱𝑖𝑁\mathbf{x}_{i}=\mathbf{x}(\tfrac{i}{N})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_x ( divide start_ARG italic_i end_ARG start_ARG italic_N end_ARG ) if we let t{0,1N,,N1N}𝑡01𝑁𝑁1𝑁t\in\{0,\tfrac{1}{N},\ldots,\tfrac{N-1}{N}\}italic_t ∈ { 0 , divide start_ARG 1 end_ARG start_ARG italic_N end_ARG , … , divide start_ARG italic_N - 1 end_ARG start_ARG italic_N end_ARG }. Then we have

    𝐱(t+Δt)𝐱𝑡Δ𝑡\displaystyle\mathbf{x}(t+\Delta t)bold_x ( italic_t + roman_Δ italic_t ) =𝐱(t)+σ(t+Δt)2σ(t)2𝐳(t)absent𝐱𝑡𝜎superscript𝑡Δ𝑡2𝜎superscript𝑡2𝐳𝑡\displaystyle=\mathbf{x}(t)+\sqrt{\sigma(t+\Delta t)^{2}-\sigma(t)^{2}}\mathbf% {z}(t)= bold_x ( italic_t ) + square-root start_ARG italic_σ ( italic_t + roman_Δ italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_z ( italic_t )
    𝐱(t)+d[σ(t)2]dtΔt𝐳(t).absent𝐱𝑡𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡Δ𝑡𝐳𝑡\displaystyle\approx\mathbf{x}(t)+\sqrt{\frac{d[\sigma(t)^{2}]}{dt}\Delta t}\;% \mathbf{z}(t).≈ bold_x ( italic_t ) + square-root start_ARG divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG roman_Δ italic_t end_ARG bold_z ( italic_t ) .

    At the limit when Δt0Δ𝑡0\Delta t\rightarrow 0roman_Δ italic_t → 0, the equation converges to

    d𝐱=d[σ(t)2]dtd𝐰.𝑑𝐱𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡𝑑𝐰d\mathbf{x}=\sqrt{\frac{d[\sigma(t)^{2}]}{dt}}\;d\mathbf{w}.italic_d bold_x = square-root start_ARG divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG end_ARG italic_d bold_w .

    We summarize our result as follows. The forward sampling equation of SMLD can be written as an SDE via d𝐱=d[σ(t)2]dtd𝐰.𝑑𝐱𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡𝑑𝐰d\mathbf{x}=\sqrt{\frac{d[\sigma(t)^{2}]}{dt}}\;d\mathbf{w}.italic_d bold_x = square-root start_ARG divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG end_ARG italic_d bold_w . (105) Mapping this to Eqn (96), we recognize that

    𝐟(𝐱,t)=0,andg(t)=d[σ(t)2]dt.formulae-sequence𝐟𝐱𝑡0and𝑔𝑡𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡\displaystyle\mathbf{f}(\mathbf{x},t)=0,\qquad\text{and}\qquad g(t)=\sqrt{% \frac{d[\sigma(t)^{2}]}{dt}}.bold_f ( bold_x , italic_t ) = 0 , and italic_g ( italic_t ) = square-root start_ARG divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG end_ARG .

    결과적으로, 우리가 역 방정식 Eqn(97)을 쓰면, 우리는 가져야 한다.

    d𝐱𝑑𝐱\displaystyle d\mathbf{x}italic_d bold_x =[𝐟(𝐱,t)g(t)2𝐱logpt(𝐱)]dt+g(t)d𝐰¯absentdelimited-[]𝐟𝐱𝑡𝑔superscript𝑡2subscript𝐱subscript𝑝𝑡𝐱𝑑𝑡𝑔𝑡𝑑¯𝐰\displaystyle=[\mathbf{f}(\mathbf{x},t)-g(t)^{2}\nabla_{\mathbf{x}}\log p_{t}(% \mathbf{x})]\;dt\;\;+\;\;g(t)d\overline{\mathbf{w}}= [ bold_f ( bold_x , italic_t ) - italic_g ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) ] italic_d italic_t + italic_g ( italic_t ) italic_d over¯ start_ARG bold_w end_ARG
    =(d[σ(t)2]dt𝐱logpt(𝐱(t)))dt+d[σ(t)2]dtd𝐰¯.absent𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡subscript𝐱subscript𝑝𝑡𝐱𝑡𝑑𝑡𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡𝑑¯𝐰\displaystyle=-\left(\frac{d[\sigma(t)^{2}]}{dt}\nabla_{\mathbf{x}}\log p_{t}(% \mathbf{x}(t))\right)dt+\sqrt{\frac{d[\sigma(t)^{2}]}{dt}}\;d\overline{\mathbf% {w}}.= - ( divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) ) italic_d italic_t + square-root start_ARG divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG end_ARG italic_d over¯ start_ARG bold_w end_ARG .

    This will give us the following reverse equation: The reverse sampling equation of SMLD can be written as an SDE via d𝐱=(d[σ(t)2]dt𝐱logpt(𝐱(t)))dt+d[σ(t)2]dtd𝐰¯.𝑑𝐱𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡subscript𝐱subscript𝑝𝑡𝐱𝑡𝑑𝑡𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡𝑑¯𝐰d\mathbf{x}=-\left(\frac{d[\sigma(t)^{2}]}{dt}\nabla_{\mathbf{x}}\log p_{t}(% \mathbf{x}(t))\right)dt+\sqrt{\frac{d[\sigma(t)^{2}]}{dt}}\;d\overline{\mathbf% {w}}.italic_d bold_x = - ( divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ( italic_t ) ) ) italic_d italic_t + square-root start_ARG divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG end_ARG italic_d over¯ start_ARG bold_w end_ARG . (106) For the discrete-time iterations, we first define α(t)=d[σ(t)2]dt𝛼𝑡𝑑delimited-[]𝜎superscript𝑡2𝑑𝑡\alpha(t)=\frac{d[\sigma(t)^{2}]}{dt}italic_α ( italic_t ) = divide start_ARG italic_d [ italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_d italic_t end_ARG. Then, using the same set of discretization setups as the DDPM case, we can show that

    𝐱(t+Δt)𝐱(t)𝐱𝑡Δ𝑡𝐱𝑡\displaystyle\mathbf{x}(t+\Delta t)-\mathbf{x}(t)bold_x ( italic_t + roman_Δ italic_t ) - bold_x ( italic_t ) =(α(t)𝐱logpt(𝐱))Δtα(t)Δt𝐳(t)absent𝛼𝑡subscript𝐱subscript𝑝𝑡𝐱Δ𝑡𝛼𝑡Δ𝑡𝐳𝑡\displaystyle=-\Big{(}\alpha(t)\nabla_{\mathbf{x}}\log p_{t}(\mathbf{x})\Big{)% }\Delta t-\sqrt{\alpha(t)\Delta t}\;\mathbf{z}(t)= - ( italic_α ( italic_t ) ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) ) roman_Δ italic_t - square-root start_ARG italic_α ( italic_t ) roman_Δ italic_t end_ARG bold_z ( italic_t )
    \displaystyle\Rightarrow\quad 𝐱(t)𝐱𝑡\displaystyle\mathbf{x}(t)bold_x ( italic_t ) =𝐱(t+Δt)+α(t)Δt𝐱logpt(𝐱)+α(t)Δt𝐳(t)absent𝐱𝑡Δ𝑡𝛼𝑡Δ𝑡subscript𝐱subscript𝑝𝑡𝐱𝛼𝑡Δ𝑡𝐳𝑡\displaystyle=\mathbf{x}(t+\Delta t)+\alpha(t)\Delta t\nabla_{\mathbf{x}}\log p% _{t}(\mathbf{x})+\sqrt{\alpha(t)\Delta t}\;\mathbf{z}(t)= bold_x ( italic_t + roman_Δ italic_t ) + italic_α ( italic_t ) roman_Δ italic_t ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_x ) + square-root start_ARG italic_α ( italic_t ) roman_Δ italic_t end_ARG bold_z ( italic_t )
    \displaystyle\Rightarrow\quad 𝐱i1subscript𝐱𝑖1\displaystyle\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT =𝐱i+αi𝐱logpi(𝐱i)+αi𝐳iabsentsubscript𝐱𝑖subscript𝛼𝑖subscript𝐱subscript𝑝𝑖subscript𝐱𝑖subscript𝛼𝑖subscript𝐳𝑖\displaystyle=\mathbf{x}_{i}+\alpha_{i}\nabla_{\mathbf{x}}\log p_{i}(\mathbf{x% }_{i})+\sqrt{\alpha_{i}}\;\mathbf{z}_{i}= bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + square-root start_ARG italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (107)
    \displaystyle\Rightarrow\quad 𝐱i1subscript𝐱𝑖1\displaystyle\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT =𝐱i+(σi2σi12)𝐱logpi(𝐱i)+(σi2σi12)𝐳i,absentsubscript𝐱𝑖superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖12subscript𝐱subscript𝑝𝑖subscript𝐱𝑖superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖12subscript𝐳𝑖\displaystyle=\mathbf{x}_{i}+(\sigma_{i}^{2}-\sigma_{i-1}^{2})\nabla_{\mathbf{% x}}\log p_{i}(\mathbf{x}_{i})+\sqrt{(\sigma_{i}^{2}-\sigma_{i-1}^{2})}\;% \mathbf{z}_{i},= bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + square-root start_ARG ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

    이는 SMLD 역갱신 방정식과 동일하다. Song and Ermon [8]는 SDE를 variance exploding (VE) SDE라고 불렀다.

    4.5 Solving SDE

    이 부분에서는 미분방정식이 어떻게 수치적으로 풀리는지에 대해 간단히 논의한다. 논의를 조금 더 쉽게 하기 위해, 우리는 ODE에 초점을 맞출 것이다. 다음 ODE 고려

    d𝐱(t)dt=𝐟(𝐱(t),t).𝑑𝐱𝑡𝑑𝑡𝐟𝐱𝑡𝑡\frac{d\mathbf{x}(t)}{dt}=\mathbf{f}(\mathbf{x}(t),t).divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = bold_f ( bold_x ( italic_t ) , italic_t ) . (108)

    If the ODE is a scalar ODE, then the ODE is dx(t)dt=f(x(t),t)𝑑𝑥𝑡𝑑𝑡𝑓𝑥𝑡𝑡\frac{dx(t)}{dt}=f(x(t),t)divide start_ARG italic_d italic_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = italic_f ( italic_x ( italic_t ) , italic_t ).

    Euler Method. Euler method is a first order numerical method for solving the ODE. Given dx(t)dt=f(x(t),t)𝑑𝑥𝑡𝑑𝑡𝑓𝑥𝑡𝑡\frac{dx(t)}{dt}=f(x(t),t)divide start_ARG italic_d italic_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = italic_f ( italic_x ( italic_t ) , italic_t ), and x(t0)=x0𝑥subscript𝑡0subscript𝑥0x(t_{0})=x_{0}italic_x ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, Euler method solves the problem via an iterative scheme for i=0,1,,N1𝑖01𝑁1i=0,1,\ldots,N-1italic_i = 0 , 1 , … , italic_N - 1 such that

    xi+1=xi+αf(xi,ti),0,1,,N1,subscript𝑥𝑖1subscript𝑥𝑖𝛼𝑓subscript𝑥𝑖subscript𝑡𝑖01𝑁1\displaystyle x_{i+1}=x_{i}+\alpha\cdot f(x_{i},t_{i}),\qquad 0,1,\ldots,N-1,italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α ⋅ italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , 0 , 1 , … , italic_N - 1 ,

    여기서 α𝛼\alphaitalic_α는 스텝 사이즈이다. 간단한 예를 생각해 보자.

    Example. [18, Example 2.2] Consider the following ODE dx(t)dt=x(t)+t22t+1.𝑑𝑥𝑡𝑑𝑡𝑥𝑡superscript𝑡22𝑡1\frac{dx(t)}{dt}=\frac{x(t)+t^{2}-2}{t+1}.divide start_ARG italic_d italic_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = divide start_ARG italic_x ( italic_t ) + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 end_ARG start_ARG italic_t + 1 end_ARG . If we apply the Euler method with a step size α𝛼\alphaitalic_α, then the iteration will take the form xi+1=xi+αf(xi,ti)=xi+α(xi+ti22)ti+1.subscript𝑥𝑖1subscript𝑥𝑖𝛼𝑓subscript𝑥𝑖subscript𝑡𝑖subscript𝑥𝑖𝛼subscript𝑥𝑖superscriptsubscript𝑡𝑖22subscript𝑡𝑖1\displaystyle x_{i+1}=x_{i}+\alpha\cdot f(x_{i},t_{i})=x_{i}+\alpha\cdot\frac{% (x_{i}+t_{i}^{2}-2)}{t_{i}+1}.italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α ⋅ italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α ⋅ divide start_ARG ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - 2 ) end_ARG start_ARG italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1 end_ARG .

    Runge-Kutta (RK) Method. 또 다른 대중적으로 사용되는 ODE 해결 방법은 Runge-Kutta(RK) 방법이다. 기존의 RK-4 알고리즘은 반복을 통해 ODE를 해결한다.

    xi+1=xi+α6(k1+2k2+2k3+k4),i=1,2,,N,formulae-sequencesubscript𝑥𝑖1subscript𝑥𝑖𝛼6subscript𝑘12subscript𝑘22subscript𝑘3subscript𝑘4𝑖12𝑁\displaystyle x_{i+1}=x_{i}+\frac{\alpha}{6}\cdot\Big{(}k_{1}+2k_{2}+2k_{3}+k_% {4}\Big{)},\qquad i=1,2,\ldots,N,italic_x start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_α end_ARG start_ARG 6 end_ARG ⋅ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 2 italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + 2 italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) , italic_i = 1 , 2 , … , italic_N ,

    여기서, 수량 k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, k2subscript𝑘2k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, k3subscript𝑘3k_{3}italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPTk4subscript𝑘4k_{4}italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT는 다음과 같이 정의된다.

    k1subscript𝑘1\displaystyle k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =f(xi,ti),absent𝑓subscript𝑥𝑖subscript𝑡𝑖\displaystyle=f(x_{i},t_{i}),= italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ,
    k2subscript𝑘2\displaystyle k_{2}italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =f(xi+αk12,ti+α2),absent𝑓subscript𝑥𝑖𝛼subscript𝑘12subscript𝑡𝑖𝛼2\displaystyle=f\left(x_{i}+\alpha\tfrac{k_{1}}{2},\;t_{i}+\tfrac{\alpha}{2}% \right),= italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α divide start_ARG italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ) ,
    k3subscript𝑘3\displaystyle k_{3}italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT =f(xi+αk22,ti+α2),absent𝑓subscript𝑥𝑖𝛼subscript𝑘22subscript𝑡𝑖𝛼2\displaystyle=f\left(x_{i}+\alpha\tfrac{k_{2}}{2},\;t_{i}+\tfrac{\alpha}{2}% \right),= italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α divide start_ARG italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_α end_ARG start_ARG 2 end_ARG ) ,
    k4subscript𝑘4\displaystyle k_{4}italic_k start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT =f(xi+αk3,ti+α).absent𝑓subscript𝑥𝑖𝛼subscript𝑘3subscript𝑡𝑖𝛼\displaystyle=f\left(x_{i}+\alpha k_{3},\;t_{i}+\alpha\right).= italic_f ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α italic_k start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_α ) .

    자세한 내용은 [18]와 같은 수법 교재를 참고할 수 있다.

    Predictor-Corrector Algorithm. 상이한 수치 솔버는 근사의 오차 측면에서 상이한 거동을 갖기 때문에, ODE(또는 SDE)를 기성 수치 솔버에 던지는 것은 다양한 정도의 오차 [19]를 초래할 것이다. 그러나, 우리가 역확산 방정식을 구체적으로 해결하려고 한다면, 그림 22에 예시된 바와 같이, 적절한 보정을 하기 위해 수치 ODE/SDE 솔버 이외의 기술을 사용하는 것이 가능하다.

    Refer to caption
    Figure 22:Prediction and correction algorithm.

    DDPM을 예로 들어보자. DDPM에서 역확산 방정식은 다음과 같다.

    𝐱i1=11βi[𝐱i+βi2𝐱logpi(𝐱i)]+βi𝐳i.subscript𝐱𝑖111subscript𝛽𝑖delimited-[]subscript𝐱𝑖subscript𝛽𝑖2subscript𝐱subscript𝑝𝑖subscript𝐱𝑖subscript𝛽𝑖subscript𝐳𝑖\mathbf{x}_{i-1}=\tfrac{1}{\sqrt{1-\beta_{i}}}\Big{[}\mathbf{x}_{i}+\tfrac{% \beta_{i}}{2}\nabla_{\mathbf{x}}\log p_{i}(\mathbf{x}_{i})\Big{]}+\sqrt{\beta_% {i}}\mathbf{z}_{i}.bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG [ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

    우리는 역확산을 위한 오일러 방법으로 고려할 수 있다. 그러나, 스코어 함수 𝐬𝜽(𝐱i,i)subscript𝐬𝜽subscript𝐱𝑖𝑖\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}_{i},i)bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i )를 이미 학습시킨 경우, 스코어 매칭 방정식을 실행할 수 있고, 즉,

    𝐱i1=𝐱i+ϵi𝐬𝜽(𝐱i,i)+2ϵi𝐳i,subscript𝐱𝑖1subscript𝐱𝑖subscriptitalic-ϵ𝑖subscript𝐬𝜽subscript𝐱𝑖𝑖2subscriptitalic-ϵ𝑖subscript𝐳𝑖\mathbf{x}_{i-1}=\mathbf{x}_{i}+\epsilon_{i}\mathbf{s}_{\boldsymbol{\theta}}(% \mathbf{x}_{i},i)+\sqrt{2\epsilon_{i}}\mathbf{z}_{i},bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) + square-root start_ARG 2 italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

    for M𝑀Mitalic_M times to make correction. 알고리즘1는 아이디어를 요약한 것이다. (우리가 점수 함수를 추정치로 대체했다는 점에 유의하라.)

    Algorithm 1 Prediction Correction Algorithm for DDPM.
      𝐱N=𝒩(0,𝐈)subscript𝐱𝑁𝒩0𝐈\mathbf{x}_{N}=\mathcal{N}(0,\mathbf{I})bold_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT = caligraphic_N ( 0 , bold_I ).
      for i=N1,,0𝑖𝑁10i=N-1,\ldots,0italic_i = italic_N - 1 , … , 0 do
         
    (Prediction)𝐱i1=11βi[𝐱i+βi2𝐬𝜽(𝐱i,i)]+βi𝐳i.Predictionsubscript𝐱𝑖111subscript𝛽𝑖delimited-[]subscript𝐱𝑖subscript𝛽𝑖2subscript𝐬𝜽subscript𝐱𝑖𝑖subscript𝛽𝑖subscript𝐳𝑖(\text{Prediction})\qquad\mathbf{x}_{i-1}=\tfrac{1}{\sqrt{1-\beta_{i}}}\Big{[}% \mathbf{x}_{i}+\tfrac{\beta_{i}}{2}\mathbf{s}_{\boldsymbol{\theta}}(\mathbf{x}% _{i},i)\Big{]}+\sqrt{\beta_{i}}\mathbf{z}_{i}.( Prediction ) bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG end_ARG [ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) ] + square-root start_ARG italic_β start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (109)
         for m=1,,M𝑚1𝑀m=1,\ldots,Mitalic_m = 1 , … , italic_M do
            
    (Correction)𝐱i1=𝐱i+ϵi𝐬𝜽(𝐱i,i)+2ϵi𝐳i,Correctionsubscript𝐱𝑖1subscript𝐱𝑖subscriptitalic-ϵ𝑖subscript𝐬𝜽subscript𝐱𝑖𝑖2subscriptitalic-ϵ𝑖subscript𝐳𝑖(\text{Correction})\qquad\mathbf{x}_{i-1}=\mathbf{x}_{i}+\epsilon_{i}\mathbf{s% }_{\boldsymbol{\theta}}(\mathbf{x}_{i},i)+\sqrt{2\epsilon_{i}}\mathbf{z}_{i},( Correction ) bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ) + square-root start_ARG 2 italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (110)
         end for
      end for

    SMLD 알고리즘의 경우, 두 방정식은 다음과 같다:

    𝐱i1subscript𝐱𝑖1\displaystyle\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT =𝐱i+(σi2σi12)𝐬𝜽(𝐱i,σi)+σi2σi12𝐳absentsubscript𝐱𝑖superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖12subscript𝐬𝜽subscript𝐱𝑖subscript𝜎𝑖superscriptsubscript𝜎𝑖2superscriptsubscript𝜎𝑖12𝐳\displaystyle=\mathbf{x}_{i}+(\sigma_{i}^{2}-\sigma_{i-1}^{2})\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x}_{i},\sigma_{i})+\sqrt{\sigma_{i}^{2}-\sigma_{i% -1}^{2}}\mathbf{z}= bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_σ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG bold_z Prediction,Prediction\displaystyle\text{Prediction},Prediction ,
    𝐱i1subscript𝐱𝑖1\displaystyle\mathbf{x}_{i-1}bold_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT =𝐱i+ϵi𝐱𝐬𝜽(𝐱i,σi)+ϵi𝐳absentsubscript𝐱𝑖subscriptitalic-ϵ𝑖subscript𝐱subscript𝐬𝜽subscript𝐱𝑖subscript𝜎𝑖subscriptitalic-ϵ𝑖𝐳\displaystyle=\mathbf{x}_{i}+\epsilon_{i}\nabla_{\mathbf{x}}\mathbf{s}_{% \boldsymbol{\theta}}(\mathbf{x}_{i},\sigma_{i})+\sqrt{\epsilon_{i}}\;\mathbf{z}= bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT bold_s start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + square-root start_ARG italic_ϵ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG bold_z Correction.Correction\displaystyle\text{Correction}.Correction .

    우리는 수정 반복을 몇 번 반복함으로써 DDPM의 예측-수정 알고리즘의 경우와 같이 이들을 짝지을 수 있다.

    Accelerate the SDE Solver. 일반 ODE 솔루션은 ODE를 해결하는 데 사용할 수 있지만 우리가 접하는 정방향 및 역방향 확산 방정식은 매우 특별한다. 사실, 그들은...의 형태를 취한다.

    d𝐱(t)dt=𝐚(t)𝐱(t)+𝐛(t),𝐱(t0)=𝐱0,formulae-sequence𝑑𝐱𝑡𝑑𝑡𝐚𝑡𝐱𝑡𝐛𝑡𝐱subscript𝑡0subscript𝐱0\frac{d\mathbf{x}(t)}{dt}=\mathbf{a}(t)\mathbf{x}(t)+\mathbf{b}(t),\qquad% \mathbf{x}(t_{0})=\mathbf{x}_{0},divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = bold_a ( italic_t ) bold_x ( italic_t ) + bold_b ( italic_t ) , bold_x ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (111)

    for some choice of functions 𝐚(t)𝐚𝑡\mathbf{a}(t)bold_a ( italic_t ) and 𝐛(t)𝐛𝑡\mathbf{b}(t)bold_b ( italic_t ), with the initial condition 𝐱(t0)=𝐱0𝐱subscript𝑡0subscript𝐱0\mathbf{x}(t_{0})=\mathbf{x}_{0}bold_x ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. This is not a complicated ODE. It is just a first order ODE. In [20], Lu et al. observed that because of the special structure of the ODE (they called the semi-linear structure), it is possible to separately handle 𝐚(t)𝐱(t)𝐚𝑡𝐱𝑡\mathbf{a}(t)\mathbf{x}(t)bold_a ( italic_t ) bold_x ( italic_t ) and 𝐛(t)𝐛𝑡\mathbf{b}(t)bold_b ( italic_t ). To understand how things work, we use a textbook result shown below. Theorem [Variation of Constants] ([21, Theorem 1.2.3]). Consider the ODE over the range [s,t]𝑠𝑡[s,t][ italic_s , italic_t ]: dx(t)dt=a(t)x(t)+b(t),wherex(t0)=x0.formulae-sequence𝑑𝑥𝑡𝑑𝑡𝑎𝑡𝑥𝑡𝑏𝑡where𝑥subscript𝑡0subscript𝑥0\frac{dx(t)}{dt}=a(t)x(t)+b(t),\qquad\text{where}\;\;x(t_{0})=x_{0}.divide start_ARG italic_d italic_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = italic_a ( italic_t ) italic_x ( italic_t ) + italic_b ( italic_t ) , where italic_x ( italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (112) The solution is given by x(t)=x0eA(t)+eA(t)t0teA(τ)b(τ)𝑑τ.𝑥𝑡subscript𝑥0superscript𝑒𝐴𝑡superscript𝑒𝐴𝑡superscriptsubscriptsubscript𝑡0𝑡superscript𝑒𝐴𝜏𝑏𝜏differential-d𝜏x(t)=x_{0}e^{A(t)}+e^{A(t)}\int_{t_{0}}^{t}e^{-A(\tau)}b(\tau)d\tau.italic_x ( italic_t ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_e start_POSTSUPERSCRIPT italic_A ( italic_t ) end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT italic_A ( italic_t ) end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - italic_A ( italic_τ ) end_POSTSUPERSCRIPT italic_b ( italic_τ ) italic_d italic_τ . (113) where A(t)=t0ta(τ)𝑑τ𝐴𝑡superscriptsubscriptsubscript𝑡0𝑡𝑎𝜏differential-d𝜏A(t)=\int_{t_{0}}^{t}a(\tau)d\tauitalic_A ( italic_t ) = ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a ( italic_τ ) italic_d italic_τ. We can further simplify the second term above by noticing that

    eA(t)A(τ)superscript𝑒𝐴𝑡𝐴𝜏\displaystyle e^{A(t)-A(\tau)}italic_e start_POSTSUPERSCRIPT italic_A ( italic_t ) - italic_A ( italic_τ ) end_POSTSUPERSCRIPT =et0ta(r)𝑑rt0τa(r)𝑑r=eτta(r)𝑑r.absentsuperscript𝑒superscriptsubscriptsubscript𝑡0𝑡𝑎𝑟differential-d𝑟superscriptsubscriptsubscript𝑡0𝜏𝑎𝑟differential-d𝑟superscript𝑒superscriptsubscript𝜏𝑡𝑎𝑟differential-d𝑟\displaystyle=e^{\int_{t_{0}}^{t}a(r)dr-\int_{t_{0}}^{\tau}a(r)dr}=e^{\int_{% \tau}^{t}a(r)dr}.= italic_e start_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a ( italic_r ) italic_d italic_r - ∫ start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_τ end_POSTSUPERSCRIPT italic_a ( italic_r ) italic_d italic_r end_POSTSUPERSCRIPT = italic_e start_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_a ( italic_r ) italic_d italic_r end_POSTSUPERSCRIPT .

    [20][8]로부터 유도된 역확산 방정식이었다:

    d𝐱(t)dt=f(t)𝐱(t)+g2(t)2σ(t)ϵ𝜽(𝐱(t),t),𝐱(t)𝒩(0,σ~2𝐈),formulae-sequence𝑑𝐱𝑡𝑑𝑡𝑓𝑡𝐱𝑡superscript𝑔2𝑡2𝜎𝑡subscriptbold-italic-ϵ𝜽𝐱𝑡𝑡similar-to𝐱𝑡𝒩0superscript~𝜎2𝐈\frac{d\mathbf{x}(t)}{dt}=f(t)\mathbf{x}(t)+\frac{g^{2}(t)}{2\sigma(t)}% \boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\mathbf{x}(t),t),\qquad\mathbf{x}(% t)\sim\mathcal{N}(0,\widetilde{\sigma}^{2}\mathbf{I}),divide start_ARG italic_d bold_x ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG = italic_f ( italic_t ) bold_x ( italic_t ) + divide start_ARG italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) end_ARG start_ARG 2 italic_σ ( italic_t ) end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ( italic_t ) , italic_t ) , bold_x ( italic_t ) ∼ caligraphic_N ( 0 , over~ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ) ,

    where f(t)=dlogα(t)dt𝑓𝑡𝑑𝛼𝑡𝑑𝑡f(t)=\frac{d\log\alpha(t)}{dt}italic_f ( italic_t ) = divide start_ARG italic_d roman_log italic_α ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG, and g2(t)=dσ(t)2dt2dlogα(t)dtσ(t)2superscript𝑔2𝑡𝑑𝜎superscript𝑡2𝑑𝑡2𝑑𝛼𝑡𝑑𝑡𝜎superscript𝑡2g^{2}(t)=\frac{d\sigma(t)^{2}}{dt}-2\frac{d\log\alpha(t)}{dt}\sigma(t)^{2}italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_t ) = divide start_ARG italic_d italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_t end_ARG - 2 divide start_ARG italic_d roman_log italic_α ( italic_t ) end_ARG start_ARG italic_d italic_t end_ARG italic_σ ( italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. Using the Variation of Constants Theorem, we can solve the ODE exactly at time t𝑡titalic_t by the formula

    𝐱(t)=estf(τ)𝑑τ𝐱(s)+st(eτtf(r)𝑑rg2(τ)2σ(τ)ϵ𝜽(𝐱(τ),τ))𝑑τ.𝐱𝑡superscript𝑒superscriptsubscript𝑠𝑡𝑓𝜏differential-d𝜏𝐱𝑠superscriptsubscript𝑠𝑡superscript𝑒superscriptsubscript𝜏𝑡𝑓𝑟differential-d𝑟superscript𝑔2𝜏2𝜎𝜏subscriptbold-italic-ϵ𝜽𝐱𝜏𝜏differential-d𝜏\mathbf{x}(t)=e^{\int_{s}^{t}f(\tau)d\tau}\mathbf{x}(s)+\int_{s}^{t}\left(e^{% \int_{\tau}^{t}f(r)dr}\frac{g^{2}(\tau)}{2\sigma(\tau)}\boldsymbol{\epsilon}_{% \boldsymbol{\theta}}(\mathbf{x}(\tau),\tau)\right)d\tau.bold_x ( italic_t ) = italic_e start_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_f ( italic_τ ) italic_d italic_τ end_POSTSUPERSCRIPT bold_x ( italic_s ) + ∫ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_e start_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_f ( italic_r ) italic_d italic_r end_POSTSUPERSCRIPT divide start_ARG italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_τ ) end_ARG start_ARG 2 italic_σ ( italic_τ ) end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ( italic_τ ) , italic_τ ) ) italic_d italic_τ .

    Then, by defining λt=logα(t)/σ(t)subscript𝜆𝑡𝛼𝑡𝜎𝑡\lambda_{t}=\log\alpha(t)/\sigma(t)italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_log italic_α ( italic_t ) / italic_σ ( italic_t ), and with additional simplifications outlined in [20], this equation can be simplified to

    𝐱(t)=α(t)α(s)𝐱(s)α(t)st(dλτdτ)σ(τ)α(τ)ϵ𝜽(𝐱(tau))𝑑τ.𝐱𝑡𝛼𝑡𝛼𝑠𝐱𝑠𝛼𝑡superscriptsubscript𝑠𝑡𝑑subscript𝜆𝜏𝑑𝜏𝜎𝜏𝛼𝜏subscriptbold-italic-ϵ𝜽𝐱𝑡𝑎𝑢differential-d𝜏\displaystyle\mathbf{x}(t)=\frac{\alpha(t)}{\alpha(s)}\mathbf{x}(s)-\alpha(t)% \int_{s}^{t}\left(\frac{d\lambda_{\tau}}{d\tau}\right)\frac{\sigma(\tau)}{% \alpha(\tau)}\boldsymbol{\epsilon}_{\boldsymbol{\theta}}(\mathbf{x}(tau))d\tau.bold_x ( italic_t ) = divide start_ARG italic_α ( italic_t ) end_ARG start_ARG italic_α ( italic_s ) end_ARG bold_x ( italic_s ) - italic_α ( italic_t ) ∫ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( divide start_ARG italic_d italic_λ start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT end_ARG start_ARG italic_d italic_τ end_ARG ) divide start_ARG italic_σ ( italic_τ ) end_ARG start_ARG italic_α ( italic_τ ) end_ARG bold_italic_ϵ start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT ( bold_x ( italic_t italic_a italic_u ) ) italic_d italic_τ .

    이 방정식을 평가하기 위해서는 오른쪽에 표시된 적분을 위한 수치적분기를 실행하기만 하면 된다. 물론, 간결함을 위해 건너뛰어야 할 ODE를 해결하기 위한 다른 수치 가속 방법이 있다.

    Congratulations! 우린 끝났어 이건 SDE에 관한 거야

    여러분 중 몇몇은 궁금할지도 모릅니다: 왜 우리는 반복적인 계획을 미분 방정식에 매핑하기를 원하나요? 몇 가지 이유가 있는데, 일부는 합법적인 반면 일부는 추측적이다.

    • 여러 확산 모델을 동일한 SDE 프레임워크로 통합함으로써 알고리즘을 비교할 수 있다. 경우에 따라 확률적 샘플링 문헌뿐만 아니라 SDE 문헌에서 아이디어를 차용하여 수치 스킴을 개선할 수 있다. 예를 들어, [8]의 예측자-수정자 스킴은 마르코프 체인 몬테 카를로와 결합된 하이브리드 SDE 솔버였다.

    • [22]와 같은 일부 논문에 따르면 확산 반복을 SDE에 매핑하면 더 많은 디자인 유연성을 제공한다.

    • 컨텍스트 확산 알고리즘 외에 일반적인 확률적 경사 하강 알고리즘에서는 Fokker-Planck 방정식과 같은 상응하는 SDE를 갖는다. 사람들은 추정치의 한계 분포를 정확히 닫힌 형태로 이론적으로 분석하는 방법을 입증했다. 이것은 잘 정의된 제한 분포를 분석하는 수단으로 무작위 알고리즘을 분석하는 어려움을 완화한다.

    5 Conclusion

    이 자습서는 최근 문헌에서 확산 기반 생성 모델의 개발을 뒷받침하는 몇 가지 기본 개념을 다룬다. 문헌의 순전한 양(그리고 빠르게 확장됨)을 감안할 때 파이썬 데모를 재활용하는 대신 근본적인 아이디어를 설명하는 것이 특히 중요하다는 것을 알게 된다. 이 자습서를 쓰면서 배운 몇 가지 교훈은 다음과 같습니다.

    • 동일한 확산 아이디어는 VAE, DDPM, SMLD 및 SDE와 같은 여러 관점에서 독립적으로 도출될 수 있다. 어떤 사람들은 다르게 주장할 수 있지만, 한 사람이 다른 사람보다 더 우월한/열등한 이유는 특별한 이유가 없다.

    • 디노이징 확산이 작동하는 주된 이유는 GAN과 VAE 시대에 실현되지 않은 작은 증가 때문이다.

    • 반복적 노이즈 제거가 현재 최첨단이지만, 접근법 자체가 궁극적인 해결책으로 보이지는 않는다. 인간은 순수한 잡음으로부터 이미지를 생성하지 않는다. 또한 확산 모델의 작은 증가 특성으로 인해 상황을 개선하기 위해 지식 증류에 대한 일부 노력이 있었지만 속도는 계속해서 주요 장애물이 될 것이다.

    • 비 가우시안으로부터 잡음을 발생시키는 것에 관한 몇몇 질문들은 정당성을 요구할 수 있다. 가우시안 분포를 도입하는 전체 이유가 유도를 더 쉽게 하기 위해서라면, 왜 우리는 우리의 삶을 더 어렵게 만들어 다른 종류의 잡음으로 전환해야 하는가?

    • 역문제에 대한 확산 모델의 적용은 쉽게 구할 수 있다. Plug-and-Play ADMM 알고리즘과 같은 기존의 역해법들에 대해, 우리는 명시적 확산 샘플러로 데노아저를 대체할 수 있다. 사람들은 이 접근법을 기반으로 개선된 이미지 복원 결과를 입증했다.

    References